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Preface 



The proceedings of ECML/PKDD 2003 are published in two volumes: the Pro- 
ceedings of the 14-th European Conference on Machine Learning (LNAI 2837) 
and the Proceedings of the 7th European Conference on Principles and Practice 
of Knowledge Discovery in Databases (LNAI 2838). The two conferences were 
held on September 22-26, 2003 in Cavtat, a small tourist town in the vicinity of 
Dubrovnik, Croatia. 

As machine learning and knowledge discovery are two highly related fields, 
the co-location of both conferences is beneficial for both research communities. In 
Cavtat, ECML and PKDD were co-located for the third time in a row, following 
the successful co-location of the two European conferences in Freiburg (2001) 
and Helsinki (2002). The co-location of ECML 2003 and PKDD 2003 resulted in 
a joint program for the two conferences, including paper presentations, invited 
talks, tutorials, and workshops. 

Out of 332 submitted papers, 40 were accepted for publication in the 
ECML 2003 proceedings, and 40 were accepted for publication in the PKDD 2003 
proceedings. All the submitted papers were reviewed by three referees. In addi- 
tion to submitted papers, the conference program consisted of four invited talks, 
four tutorials, seven workshops, two tutorials combined with a workshop, and a 
discovery challenge. 

We wish to express our gratitude to 

— the authors of submitted papers, 

— the program committee members, for thorough and timely paper evaluation, 

— invited speakers Pieter Adriaans, Leo Breiman, Christos Faloutsos, and 
Donald B. Rubin, 

— tutorial and workshop chairs Stefan Kramer, Luis Torgo, and Luc Dehaspe, 

— local and technical organization committee members, 

— advisory board members Luc De Raedt, Tapio Elomaa, Peter Flach, Heikki 
Mannila, Arno Siebes, and Hannu Toivonen, 

— awards and grants committee members Dunja Mladenic, Rob Holte, and 
Michael May, 

— Richard van der Stadt for the development of CyberChair which was used 
to support the paper submission and evaluation process, 

— Alfred Hofmann of Springer- Verlag for co-operation in publishing the pro- 
ceedings, and finally 

— we gratefully acknowledge the financial support of the Croatian Ministry 
of Science and Technology, Slovenian Ministry of Education, Science, and 
Sports, and the Knowledge Discovery Network of Excellence (KDNet). 
KDNet also sponsored the student grants and best paper awards, while 
Kluwer Academic Publishers (the Machine Learning Journal) awarded a 
prize for the best student paper. 
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Preface 



We hope and trust that the week in Cavtat in late September 2003 will be 
remembered as a fruitful, challenging, and enjoyable scientific and social event. 



June 2003 


Nada Lavrac 
Dragan Gamberger 



Hendrik Blockeel 
Ljupco Todorovski 



ECML/PKDD 2003 Organization 



Executive Committee 

Program Chairs: Nada Lavrac (Jozef Stefan Institute, Slovenia) 

ECML and PKDD chair 

Dragan Gamberger (Rudjer Boskovic Institute, 
Croatia) ECML and PKDD co-chair 
Hendrik Blockeel (Katholieke Universiteit Leu- 
ven, Belgium) ECML co-chair 
Ljupco Todorovski (Jozef Stefan Institute, 
Slovenia) PKDD co-chair 
Tutorial and Workshop Chair: Stefan Kramer 

(Technische Universitat Miinchen, Germany) 
Luis Torgo (University of Porto, Portugal) 

Luc Dehaspe (PharmaDM, Belgium) 

Petr Berka (University of Economics, Prague, 
Czech Republic) 

Luc De Raedt (Albert-Ludwigs 
University Freiburg, Germany) 

Tapio Elomaa (University of Helsinki, Finland) 
Peter Flach (University of Bristol, UK) 

Heikki Mannila (Helsinki Institute for Informa- 
tion Technology, Finland) 

Arno Siebes 

(Utrecht University, The Netherlands) 

Hannu Toivonen 
(University of Helsinki, Finland) 

Awards and Grants Committee: Dimja Mladenic 

(Jozef Stefan Institute, Slovenia) 

Rob Holte (University of Alberta, Canada) 
Michael May (Fraunhofer AIS, Germany) 
Hendrik Blockeel 

(Katholieke Universiteit Leuven, Belgium) 
Local Chairs: Dragan Gamberger, Tomislav Smuc 

(Rudjer Boskovic Institute) 

Organization Committee: Darek Krzywania, Celine Yens, Jan Struyf 

(Katholieke Universiteit Leuven, Belgium), 
Damjan Demsar, Branko Kavsek, Milica Bauer, 
Bernard Zenko, Peter Ljubic 
(Jozef Stefan Institute, Slovenia), 

Mima Benat (Rudjer Boskovic Institute), 
Dalibor Ivusic 

(The Polytechnic of Dubrovnik, Croatia), 
Zdenko Sonicki (University of Zagreb, Croatia) 



Workshop Co-chair: 
Tutorial Co-chair: 
Challenge Chair: 

Advisory Board: 



VIII ECML/PKDD 2003 Organization 



ECML 2003 Program Committee 



H. Blockeel, Belgium 

A. van den Bosch, The Netherlands 

H. Bostrom, Sweden 

I. Bratko, Slovenia 
P. Brazdil, Portugal 
W. Buntine, Finland 

M. Craven, USA 

N. Cristianini, USA 

J. Cussens, UK 

W. Daelemans, Belgium 

L. Dehaspe, Belgium 
L. De Raedt, Germany 

S. Dzeroski, Slovenia 

T. Elomaa, Finland 

F. Esposito, Italy 

B. Filipic, Slovenia 
P. Flach, UK 

J. Fiirnkranz, Austria 
J. Gama, Portugal 
D. Gamberger, Croatia 
J.-G. Ganascia, France 

L. Getoor, USA 

H. Hirsh, USA 

T. Hofmann, USA 
T. Horvath, Germany 
T. Joachims, USA 
D. Kazakov, UK 

R. Khardon, USA 
Y. Kodratoff, France 

I. Kononenko, Slovenia 

S. Kramer, Germany 

M. Kubat, USA 
S. Kwek, USA 

N. Lavrac, Slovenia 

C. Ling, Canada 

R. Lopez de Mantaras, Spain 



D. Malerba, Italy 
H. Mannila, Finland 

S. Matwin, Canada 

J. del R. Millan, Switzerland 
D. Mladenic, Slovenia 

K. Morik, Germany 
H. Motoda, Japan 
R. Nock, France 

D. Page, USA 

G. Paliouras, Greece 

B. Pfahringer, New Zealand 

E. Plaza, Spain 

J. Rousu, Finland 

C. Rouveirol, France 

L. Saitta, Italy 

T. Scheffer, Germany 

M. Sebag, France 

J. Shawe-Taylor, UK 

A. Siebes, The Netherlands 

D. Sleeman, UK 

R. H. Sloan, USA 

M. van Someren, The Netherlands 
P. Stone, USA 
J. Suykens, Belgium 

H. Tirri, Finland 

L. Todorovski, Slovenia 

L. Torgo, Portugal 
P. Turney, Canada 

P. Vitanyi, The Netherlands 

S. M. Weiss, USA 
G. Widmer, Austria 

M. Wiering, The Netherlands 

R. Wirth, Germany 

S. Wrobel, Germany 

T. Zeugmann, Germany 

B. Zupan, Slovenia 



ECML/PKDD 2003 Organization 



PKDD 2003 Program Committee 



H. Ahonen-Myka, Finland 

E. Baralis, Italy 

R. Bellazzi, Italy 
M.R. Berthold, USA 
H. Blockeel, Belgium 
M. Bohanec, Slovenia 

J.F. Boulicaut, France 

B. Cremilleux, France 
L. Dehaspe, Belgium 

L. De Raedt, Germany 

S. Dzeroski, Slovenia 

T. Elomaa, Finland 

M. Ester, Canada 

A. Feelders, The Netherlands 
R. Feldman, Israel 
P. Flach, UK 

E. Frank, New Zealand 
A. Freitas, UK 

J. Fiirnkranz, Austria 
D. Gamberger, Croatia 

F. Giannotti, Italy 

C. Giraud-Carrier, Switzerland 

M. Grobelnik, Slovenia 
H.J. Hamilton, Canada 
J. Han, USA 

R. Hilderman, Canada 
H. Hirsh, USA 

S. J. Hong, USA 

F. Hoppner, Germany 
S. Kaski, Finland 
J.-U. Kietz, Switzerland 

R. D. King, UK 

W. Kloesgen, Germany 
Y. Kodratoff, France 
J.N. Kok, The Netherlands 

S. Kramer, Germany 

N. Lavrac, Slovenia 

G. Manco, Italy 



H. Mannila, Finland 
S. Matwin, Canada 
M. May, Germany 

D. Mladenic, Slovenia 
S. Morishita, Japan 
H. Motoda, Japan 

G. Nakhaeizadeh, Germany 

C. Nedellec, France 

D. Page, USA 
Z.W. Ras, USA 

J. Rauch, Czech Republic 

G. Ritschard, Switzerland 
M. Sebag, France 

F. Sebastian!, Italy 
M. Sebban, France 

A. Siebes, The Netherlands 

A. Skowron, Poland 

M. van Someren, The Netherlands 

M. Spiliopoulou, Germany 

N. Spyratos, France 

R. Stolle, USA 

E. Suzuki, Japan 
A. Tan, Singapore 

L. Todorovski, Slovenia 

H. Toivonen, Finland 
L. Torgo, Portugal 

S. Tsumoto, Japan 

A. Unwin, Germany 

K. Wang, Canada 

L. Wehenkel, Belgium 

D. Wettschereck, Germany 

G. Widmer, Austria 

R. Wirth, Germany 

S. Wrobel, Germany 

M. J. Zaki, USA 
D.A. Zighed, France 

B. Zupan, Slovenia 



X ECML/PKDD 2003 Organization 



ECML/PKDD 2003 Additional Reviewers 



F. Aiolli 


L. Geng 


H.S. Nguyen 


A. Amrani 


A. Giacometti 


S. Nijssen 


A. Appice 


T. Giorgino 


A. Nowe 


E. Armengol 


B. Goethals 


M. Ohtani 


I. Autio 


M. Grabert 


S. Ontanon 


J. Aze 


E. Gyftodimos 


R. Ortale 


I. Azzini 


W. Hamalainen 


M. Quid Abdel Vetah 


M. Baglioni 


A. Habrard 


G. Paafi 


A. Banerjee 


M. Hall 


I. Palmisano 


T.M.A. Basile 


S. Hoche 


J. Peltonen 


M. Bendou 


E. Hiillermeier 


L. Pena 


M. Berardi 


L. Jacobs 


D. Pedreschi 


G. Beslon 


A. Jakulin 


G. Petasis 


M. Bevk 


T.Y. Jen 


J. Petrak 


A. Blumenstock 


B. Jeudy 


V. Phan Luong 


D. Bojadziev 


A. Jorge 


D. Pierrakos 


M. Borth 


R.J. Jun 


U. Riickert 


J. Brank 


P. Juvan 


S. Riiping 


P. Brockhausen 


M. Kaariainen 


J. Ramon 


M. Ceci 


K. Karimi 


S. Ray 


E. Cesario 


K. Kersting 


C. Rigotti 


S. Chiusano 


J. Kindermann 


F. Rioult 


J. Clech 


S. Kiritchenko 


M. Robnik-Sikonja 


A. Cornuejols 


W. Kosters 


M. Roche 


J. Costa 


I. Koychev 


B. Rosenfeld 


T. Curk 


M. Kukar 


A. Sadikov 


M. Degemmis 


S. Lallich 


T. Saito 


D. Demsar 


C. Larizza 


E. Savia 


J. Demsar 


D. Laurent 


C. Savu-Krohn 


M. Denecker 


G. Leban 


G. Schmidberger 


N. Di Mauro 


S.D. Lee 


M. Scholz 


K. Driessens 


G. Legrand 


A.K. Seewald 


T. Erjavec 


E. Leopold 


J. Sese 


T. Euler 


J. Leskovec 


G. Sigletos 


N. Fanizzi 


0. Licchelli 


T. Silander 


S. Ferilli 


J.T. Lindgren 


D. Slezak 


M. Fernandes 


F.A. Lisi 


C. Soares 


D. Finton 


T. Malinen 


D. Sonntag 


S. Flesca 


0. Matte-Tailliez 


H.-M. Suchier 


J. Franke 


A. Mazzanti 


B. Sudha 


F. Furfaro 


P. Medas 


P. Synak 


T. Gartner 


R. Meo 


A. Tagarelli 


P. Garza 


T. Mielikainen 


Y. Tzitzikas 



ECML/PKDD 2003 Organization 



R. Vilalta 
D. Vladusic 
X. Wang 
A. Wojna 
J. Wroblewski 



M. Wurst 
R.J. Yan 
X. Yan 
H. Yao 
X. Yin 



I. Zogalis 
W. Zou 
M. Znidarsic 
B. Zenko 



XII ECML/PKDD 2003 Organization 



ECML/PKDD 2003 Tutorials 

KD Standards 

Sarabjot S. Anand, Marko Grobelnik, and Dietrich Wettschereck 

Data Mining and Machine Learning in Time Series Databases 
Eamonn Keogh 

Exploratory Analysis of Spatial Data and Decision Making Using Interactive 
Maps and Linked Dynamic Displays 
Natalia Andrienko and Gennady Andrienko 

Music Data Mining 
Darrell Gonklin 

ECML/PKDD 2003 Workshops 

First European Web Mining Forum 

Bettina Berendt, Andreas Hotho, Dunja Mladenic, Maarten van Someren, Myra 
Spiliopoulou, and Gerd Stumme 

Multimedia Discovery and Mining 
Dunja Mladenic and Gerhard Paaj] 

Data Mining and Text Mining in Bioinformatics 
Tobias Schejfer and Ulf Leser 

Knowledge Discovery in Inductive Databases 

Jean-Frangois Boulicaut, Saso Dzeroski, Mika Klemettinen, Rosa Meo, and Luc 
De Raedt 

Graph, Tree, and Sequence Mining 
Luc De Raedt and Takashi Washio 

Probabilistic Graphical Models for Glassification 

Pedro Larrahaga, Jose A. Lozano, Jose M. Pena, and Ihaki Inza 

Parallel and Distributed Gomputing for Machine Learning 
Rui Gamacho and Ashwin Srinivasan 

Discovery Ghallenge: A Gollaborative Effort in Knowledge Discovery 
from Databases 

Petr Berka, Jan Rauch, and Shusaku Tsumoto 



ECML/PKDD 2003 Joint Tutorials- Workshops 

Learning Gontext-Free Grammars 

Golin de la Higuera, Jose Oncina, Pieter Adriaans, Menno van Zaanen 

Adaptive Text Extraction and Mining 
Fabio Giravegna, Nicholas Kushmerick 



Table of Contents 



Invited Papers 

From Knowledge-Based to Skill-Based Systems: 

Sailing as a Machine Learning Challenge 1 

Pieter Adriaans 

Two-Eyed Algorithms and Problems 9 

Leo Breiman 

Next Generation Data Mining Tools: Power Laws and Self-similarity 

for Graphs, Streams and Traditional Data 10 

Christos Faloutsos 

Taking Causality Seriously: Propensity Score Methodology Applied 

to Estimate the Effects of Marketing Interventions 16 

Donald B. Rubin 

Contributed Papers 

Support Vector Machines with Example Dependent Costs 23 

Ulf Brefeld, Peter Geihel, and Fritz Wysotzki 

Abalearn: A Risk-Sensitive Approach to Self-play Learning in Abalone .... 35 
Pedro Campos and Thibault Langlois 

Life Cycle Modeling of News Events Using Aging Theory 47 

Chien Chin Chen, Yao-Tsung Chen, Yeali Sun, and Meng Chang Chen 

Unambiguous Automata Inference by Means of State-Merging Methods ... 60 
Frangois Coste and Daniel Fredouille 

Could Active Perception Aid Navigation 

of Partially Observable Grid Worlds? 72 

Paul A. Crook and Cillian Hayes 

Combined Optimization of Feature Selection and Algorithm Parameters 

in Machine Learning of Language 84 

Walter Daelemans, Veronique Hoste, Fien De Meulder, 
and Bart Naudts 

Iteratively Extending Time Horizon Reinforcement Learning 96 

Damien Ernst, Pierre Ceurts, and Louis Wehenkel 



XIV Table of Contents 



Volume under the ROC Surface for Multi-class Problems 108 

Cesar Ferri, Jose Hernandez- Orallo, and Miguel Angel Salido 

Improving the AUC of Probabilistic Estimation Trees 121 

Cesar Ferri, Peter A. Flaeh, and Jose Hernandez- Orallo 

Scaled CGEM: A Fast Accelerated EM 133 

Jorg Fischer and Kristian Kersting 

Pairwise Preference Learning and Ranking 145 

Johannes Fumkranz and Fyke Hiillermeier 

A New Way to Introduce Knowledge into Reinforcement Learning 157 

Pascal Carcia 

Improvement of the State Merging Rule on Noisy Data 

in Probabilistic Grammatical Inference 169 

Amaury Habrard, Marc Bernard, and Marc Sebban 

collective INtelligence with Sequences of Actions - 

Coordinating Actions in Multi-agent Systems 181 

Pieter Jan ’t Hoen and Sander M. Bohte 

Rademacher Penalization over Decision Tree Prunings 193 

Matti Kddridinen and Tapio Elomaa 

Learning Rules to Improve a Machine Translation System 205 

David Kauchak and Charles Elkan 

Optimising Performance of Competing Search Engines 

in Heterogeneous Web Environments 217 

Rinat Khoussainov and Nicholas Kushmerick 

Robust /c-DNF Learning via Inductive Belief Merging 229 

Erederic Koriche and Joel Quinqueton 

Logistic Model Trees 241 

Niels Landwehr, Mark Hall, and Eibe Frank 

Color Image Segmentation: Kernel Do the Feature Space 253 

Jianguo Lee, Jingdong Wang, and Changshui Zhang 

Evaluation of Topographic Clustering and Its Kernelization 265 

Marie- Jeanne Lesot, Florence d’Alche-Buc, and Ceorges Siolas 

A New Pairwise Ensemble Approach for Text Classification 277 

Yan Liu, Jaime Carbonell, and Rang Jin 

Self-evaluated Learning Agent in Multiple State Games 289 

Koichi Moriyama and Masayuki Numao 



Table of Contents 



XV 



Classification Approach towards Ranking and Sorting Problems 301 

Shyamsundar Rajaram, Ashutosh Garg, Xiang Sean Zhou, 
and Thomas S. Huang 

Using MDP Characteristics to Guide Exploration 

in Reinforcement Learning 313 

Bohdana Ratitch and Doina Precup 

Experiments with Cost-Sensitive Feature Evaluation 325 

Marko Robnik-Sikonja 

A Markov Network Based Factorized Distribution Algorithm 

for Optimization 337 

Roberto Santana 

On Boosting Improvement: Error Reduction and Convergence Speed-Up . . 349 
Marc Sebban and Henri-Maxime Suchier 

Improving SVM Text Classification Performance 

through Threshold Adjustment 361 

James G. Shanahan and Norbert Roma 

Backoff Parameter Estimation for the DOP Model 373 

Khalil Sima ’an and Luciano Buratto 

Improving Numerical Prediction with Qualitative Constraints 385 

Dorian Sue and Ivan Bratko 

A Generative Model for Semantic Role Labeling 397 

Gynthia A. Thompson, Roger Levy, and Ghristopher D. Manning 

Optimizing Local Probability Models for Statistical Parsing 409 

Kristina Toutanova, Mark Mitchell, and Ghristopher D. Manning 

Extended Replicator Dynamics as a Key to Reinforcement Learning 

in Multi-agent Systems 421 

Karl Tuyls, Dries Heytens, Ann Nowe, and Bernard Manderick 

Visualizations for Assessing Convergence and Mixing of MCMC 432 

Jarkko Venna, Samuel Kaski, and Jaakko Peltonen 

A Decomposition of Classes via Clustering to Explain 

and Improve Naive Bayes 444 

Ricardo Vilalta and Irina Rish 

Improving Rocchio with Weakly Supervised Clustering 456 

Romain Vinot and Frangois Yvon 

A Two-Level Learning Method for Generalized Multi-instance Problems . . 468 
Nils Weidmann, Eibe Frank, and Bernhard Pfahringer 



XVI Table of Contents 



Clustering in Knowledge Embedded Space 480 

Yungang Zhang, Changshui Zhang, and Shijun Wang 

Ensembles of Multi-instance Learners 492 

Zhi-Hua Zhou and Min-Ling Zhang 

Author Index 503 



From Knowledge-Based to Skill-Based Systems: 
Sailing as a Machine Learning Challenge 
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Abstract. This paper describes the Robosail project. It started in 1997 
with the aim to build a self-learning auto pilot for a single handed sailing 
yacht. The goal was to make an adaptive system that would help a single 
handed sailor to go faster on average in a race. Presently, after hve years 
of development and a number of sea trials, we have a commercial system 
available (www.robosail.com). It is a hybrid system using agent tech- 
nology, machine learning, data mining and rule-based reasoning. Apart 
from describing the system we try to generalize our findings, and argue 
that sailing is an interesting paradigm for a class of hybrid systems that 
one could call Skill-based Systems. 



1 Introduction 

Sailing is a difficult sport that requires a lot of training and expert knowledge 
[1],[9],[6]. Recently the co-operation of crews on a boat has been studied in the 
domain of cognitive psychology [4] . In this paper we describe the Robosail system 
that aims at the development of self-learning steering systems for racing yachts 
[8] . We defend the view that this task is an example of what one could call skill- 
based systems. The connection between verbal reports of experts performing 
a certain task and the implementation of ML for those task is an interesting 
emerging research domain [3], [2], [7]. The system was tested in several real-life 
race events and is currently commercially available. 



2 The Task 

Modern single-handed sailing started its history with the organization of the 
first Observer Single-Handed Transatlantic Race (OSTAR) in 1960. Since that 
time the sport has known a tremendous development and is the source of many 
innovations in sailing. A single-handed skipper can only attend the helm for 
about 20% of his time. The rest is divided between boat-handling, navigation, 
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preparing meals, doing repairs and sleeping. All single-handed races allow the 
skippers to use some kind of autopilot. In its simplest form such an autopilot is 
attached to a flux-gate compass and it can only maintain a compass course. More 
sophisticated autopilots use a variety of sensors (wind, heel, global positioning 
system etc.) to steer the boat optimally. In all races the use of engines to propel 
the boat and of electrical winches to operate the sails is forbidden. All boat- 
handling except steering is to be done with manual power only. 

It is clear that a single-handed sailor will be less efficient than a full crew. 
Given the fact that a single-handed yacht operates on autopilot for more than 80 
% of the time a slightly more efficient autopilot would already make a yacht more 
competitive. In a transatlantic crossing a skipper will alter course maybe once or 
twice a day based on meteorological data and information and from various other 
sources like the positions of the competitors. From an economic point of view 
the automatization of this task has no top priority. It is the optimization of the 
handling of the helm from second to second that offers the biggest opportunity 
for improvement. The task to be optimized is then: steer the ship as fast as 
possible in a certain direction and give the skipper optimal support in terms of 
advice on boat-handling, early warnings, alerts etc. 

3 Introduction 

Our initial approach to the limited task of maintaining the course of a vessel 
was to conceive it as a pure machine learning task. At any given moment the 
boat would be in a certain region of a complex state-space defined by the array 
of sensor inputs. There was a limited set of actions defined in terms of a force 
exercised on the rudder, and there was a reward defined in terms of the overall 
speed of the boat. Fairly soon it became clear that it was not possible to solve 
the problem in terms of simple optimization of a system in a state-space: 

~ There is no neutral theory-free description of the system. A sailing yacht is a 
system that exists on the border between two media with strong non-linear 
behavior, wind and water. The interaction between these media and the 
boat should ideally be modelled in terms of complex differential equations. 
A finite set of sensors will never be able to give enough information to analyze 
the system in all of its relevant aspects. A careful selection of sensors given 
economical, energy management and other practical constraint is necessary. 
In order to make this selection one needs a theory about what to measure. 

— Furthermore, given the complexity of the mathematical description, there is 
no guarantee that the system will know regions of relative stability in which 
it can be controlled efficiently. The only indication we have that efficient 
control is possible is the fact that human experts do the task well, and the 
best guess as to select which sensors is the informal judgement of experts on 
the sort of information they need to perform the task. The array of sensors 
that ‘describes’ the system is in essence already anthropomorphic. 

— Establishing the correct granularity of the measurements is a problem. Wind 
and wave information typically comes with the frequency of at least 10 hz. 
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But hidden in these signals are other concepts that exist only on a different 
timescale eg. gusts (above 10 seconds), veering (10 minutes) and sea-state 
(hours). A careful analysis of sensor information involved in sailing shows 
that sensors and the concepts that can be measured with them cluster in 
different time-frames (hundreds of seconds, minutes, hours). This is a strong 
indication for a modular architecture. The fact that at each level decisions 
of a different nature have to be taken strongly suggest an architecture that 
consists of a hierarchy of agents that operate in different time-frames: lower 
agents have a higher measurement granularity, higher agents a lower one. 

— Even when a careful selection of sensors is made and an adequate agent- 
architecture is in place the convergence of the learning algorithms is a prob- 
lem. Tabula rasa learning is in the context of sailing impossible. One has to 
start with a rough rule-based system that operates the boat reasonably well 
and use ML techniques to optimize the system. 

In the end we developed a hybrid agent based system. It merges traditional AI 
techniques like rule based reasoning with more recent methods developed in the 
ML community. Essential for this kind of systems is the link between expert 
concepts that have a fuzzy nature and learning algorithms. A simple example of 
an expert rule in the Robosail system is: If you sail close-hauled then lujf in a 
gust. This rule contains the concepts ‘close-hauled’, ‘gust’ and ‘luff’. The system 
contains agents that represent these concepts: 

— Course agent: If the apparent wind angle is between A and B then you sail 
close hauled 

— Gust agent: If the average apparent wind increases by a factor D more than 
E seconds then there is a gust 

— Luff agent: Steer Z degrees windward. 

The related learning methodology then is: 

— Task: Learn optimal values for A,B,C,D,E,Z 

— Start with expert estimates then 

— Optimize using ML techniques 

This form of symbol grounding is an emerging area of research interest that seems 
to be of vital importance to the kind of skill-based systems like Robosail [8] , [2] , 

[7], [5]. 

4 The System 

The main systems contains four agents: Skipper, Navigator, Watchman and 
Helmsman. These roles are more or less modelled after the task division on 
a modern racing yacht [9]. Each agent lives in a different time frame, the agents 
are ordered in a subsumption hierarchy. The skipper is intended to take strategi- 
cal decisions with a time interval of say 3 to 6 hours. He has to take into account 
weather patterns, currents, seastate, chart info etc. Currently this process is only 
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partly automated. It results in the determination of a waypoint, i.e. a location 
on the map where we want to arrive as soon as possible. The navigator and the 
watchman have the responsibility to get to the waypoint. The navigator deals 
with the more tactic aspects of this process. He knows the so-called polar dia- 
grams of the boat and its behavior in various sea states. He also has a number 
of agents at his disposal that help him to asses the state of the ship: do we carry 
too much sail, is there too much current, is our trim correct etc. The reasoning 
of the navigator results in a compass course. This course could change within 
minutes. The watchman is responsible for keeping this course with optimal ve- 
locity in the constantly changing environment (waves, wind shifts etc.). He gives 
commands to the helmsman, whose only responsibility it is to make and execute 
plans to get and keep the rudder in certain positions in time. 

The Al solution: a hybrid agent based approach 



Input 

Wheather Maps, Electronic 
Charts, Tidal Info, 



Goal: Waypoint 

Info: COG, Position, Tidal Info, 
Polars, Variation, Deviation 



Goal: Compass Course 
Info: App. Wind Speed /Angle, 
Heel, Speed, Sailtrim, Polar 

Action: Rudder(Delta, Time) 
Info: Rudderangle, Speed, 
Trim, Seastate, Heel 



Processing 




Output 

Goal: Waypoint 



Goal: Compass Course 
Info: VMG or Course 



Action: Rudder(Delta, Time) 
Info: Trim, Seastate 



Direction(L,R) 
Force [0,1] 



Fig. 1. The hierarchy of main agents 



There are a number of core variables: log speed, apparent wind speed and 
angle, rudder angle, compass course, current position, course on ground and 
speed on ground. These are loaded into the kernel system. Apart from these core 
variables there are a number of other sensors that give information. Amongst 
others: canting angle mast, swivel angle keel, heel sideways, heel fore-aft, depth, 
sea state, wave direction, acceleration in various directions. Others will activate 
agents that warn for certain undesirable situations (i.e. depth, temperature of 
the water). Others are for the moment only used for human inspection (i.e. 
radar images). For each sensor we have to consider the balance between the 
contribution to speed and safety of the boat and the negative aspects like energy 
consumption, weight, increased complexity of the system. 
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Fig. 2. The main architecture 



The final system is a complex interplay between sensor-, agent- and network 
technology, machine learning and AI techniques brought together in a hybrid 
architecture. The hardware (CE radiation level requirements, water and shock 
proof, easy to mount and maintain) consists of: 

— CAN bus architecture: Guaranteed delivery 

— Odys Intelligent rudder control unit (IRCU): 20 kHz, max. 100 Amp (Ex- 
tensive functions for self-diagnosis) 

— Thetys Solid state digital motion sensor and compass 

— Multifunction display 

— Standard third party sensors with NMEA interface (e.g. B&G) 

The software functionality involves: 

— Agent based architecture 

— Subsumption architecture 

— Model builder: on line visual programming 

— Real Time flow charting 

— Relational database with third party datamining facility 

— Web enabling 

— Remote control and reporting 

Machine Learning and AI techniques that are used: 

— Watch man: Case Based Reasoning 

— Helmsman: neural network on top of PID controller 

— Advisor: nearest-neighbor search 
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— Agents and virtual sensors for symbol grounding 

— Data-explorer with machine learning suite 

— Waverider: 30 dimensional ARMA model 

— Off line KDD effort: rule induction on the basis of fuzzy expert concepts 

Several protypes of the Robsosail system have been tested over the years: the 
first version in the Single Handed Transatalantic in 2000, a second prototype was 
evaluated on board the Kingfisher during a trip from Brazil to the UK. A final 
evaluation was done on board of the Syllogic Sailing Lab during the Dual Round 
Britain and Ireland in 2002. In 2003 the first commercial version is available. 

5 Lessons Learned 

The Robosail application is a hybrid system that can be placed somewhere be- 
tween pure rule-based systems and pure machine learning systems. The nature of 
these systems raises some interesting philosophical issues concerning the nature 
of rules and their linguistic representations. In the course of history people have 
discovered that certain systems can be built and controlled, without really un- 
derstanding why this is the case. A sailing boat is such a system. It is what it is 
because of ill-understood hydro- and aerodynamical principles and has a certain 
form because the human body has to interact with it. It is thoroughly an an- 
thropomorphic machine. Human beings can handle these systems, because they 
are the result of a long evolutionary process. Their senses are adapted to those 
regions of reality that are relatively stable and are sensitive to exactly those 
phase changes that give relevant information about the state of the systems. 
In a process of co-evolution the language to communicate about these concepts 
emerged. Specific concepts like ‘wave’, ‘gust’ and ‘veering’ exist because they 
mark relevant changes of the system. Their cognitive status however is complex, 
and it appears to be non-trivial to develop automated systems that discover 
these concepts on the basis of sensor data. 

A deeper discussion of these issues would have to incorporate an analysis of 
the nature of rules that is beyond the scope of this paper. The rules of a game 
like chess exist independently of their verbal representation. We use the verbal 
representation to communicate with others about the game and to train young 
players. A useful distinction is the one between constitutive rules and regulative 
rules. The constitutive rules define the game. If they are broken the game stops. 
An example for chess would be: You may not move a piece to a square already 
occupied by one of your own pieces. Regulative rules define good strategies for 
the game. If you break them you diminish your chances of winning, but the 
game does not stop. An example of a regulative rule for chess would be: When 
you are considering giving up some of your pieces for some of your opponent’s, 
you should think about the values of the men, and not just how many each player 
possesses. Regulative rules represent the experience of expert players. They have 
a certain fuzzyness and it is difficult to implement them in pure knowledge-based 
systems. The only way we can communicate about skills is in terms of regulative 
rules. The rule If you sail clause hauled then luff in gust is an example. Verbal 
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reports of experts in terms of regulative rules can play an important role in the 
design of systems. From a formal point of view they reduce the complexity of the 
task. They tell us where to look in the state space of the system. From a cognitive 
point of view they play a similar role in teaching skills. They tell the student 
roughly what to do. The fine tuning of the skill is then a matter of training. 
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Fig. 3. A taxonomy of systems 



This discussion suggests that we can classify tasks in two dimensions: 1) The 
expert dimension: Do human agents perform well on the task and can they re- 
port verbally on their actions and 2) The formal dimension: do we have adequate 
formal models of the task that allow us to perform tests in silico? For chess and 
a number of other tasks that were analyzed in the early stages of AI research the 
answer to both questions is yes. Operations research studies systems for which 
the first answer is no and the second answer is yes. For sailing the answer to 
the first question is positive, the answer to the second question negative. This 
is typical for skill-based systems. This situation has a number of interesting 
methodological consequences: we need to incorporate the knowledge of human 
experts into our system, but this knowledge in itself is fundamentally incomplete 
and needs to be embedded in an adaptive environment. Naturally this leads to 
issues concerning symbol grounding, modelling human judgements, hybrid ar- 
chitectures and many other fundamental questions relevant for the construction 
of ML applications in this domain. 

A simple sketch of a methodology to develop skill-based systems would be: 

— Select sensor type and range based on expert input 

— Develop partial model based on expert terminology 
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— Create agents that emulate expert judgements 

— Refine model using machine learning techniques 

— Evaluate model with expert 



6 Conclusion and Further Research 

In this paper we have sketched our experiences creating an integrated system 
for steering a sailing yacht. The value of such practical projects can hardly be 
overestimated. Building real life systems is 80% engineering and 20% science. 
One of the insights we developed is the notion of the existence of a special class 
of skill-based systems. Issues in constructing these systems are: the need for a 
hybrid architecture, the interplay between discursive rules (expert system, rule 
inductionjand senso-motoric skills (pid-controllers, neural networks), a learning 
approach, agent technology, the importance of semantics and symbol grounding 
and the importance of jargon. The nature of skill-based systems raises interesting 
philosophical issues concerning the nature of rules and their verbal representa- 
tions. 

In the near future we intend to develop more advanced systems. The current 
autopilot is optimized to sail as fast as possible from A to B. A next generation 
would also address tactical and strategic tasks, tactical: win the race (modelling 
your opponents), strategic: bring the crew safely to the other side of the ocean. 
Other interesting ambitions are: the construction of better autopilots for multi- 
hulls, the design an ultra-safe autonomous cruising yacht, establish an official 
speed record for autonomous sailing yachts and deploy the Robosail technology 
in other areas like the Automotive industry and aviation industry. 
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Two-eyed algorithms are complex prediction algorithms that give accurate pre- 
dictions and also give important insights into the structure of the data the al- 
gorithm is processing. The main example I discuss is RF/ tools, a collection of 
algorithms for classification, regression and multiple dependent outputs. The last 
algorithm is a preliminary version and further progress depends on solving some 
fascinating questions of the characterization of dependency between variables. 

An important and intriguing aspect of the classification version of RF / tools 
is that it can be used to analyze unsupervised data-that is, data without class 
labels. This conversion leads to such by-products as clustering, outlier detection, 
and replacement of missing data for unsupervised data. 

The talk will present numerous results on real data sets. The code (f77) and 
ample documentation for RFtools is available on the web site 
WWW . stat . berkeley . edu/RFtools. 



References 

1. Leo Breiman. Random forests. Machine Learning, 45(l):5-32, 2001. 



N. Lavrac et al. (Eds.): ECML 2003, LNAI 2837, p. 9, 2003. 
(c) Springer- Verlag Berlin Heidelberg 2003 



Next Generation Data Mining Tools: 
Power Laws and Self-similarity 
for Graphs, Streams and Traditional Data 



Christos Faloutsos 

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 
christosScs . cmu.edu 



Abstract. What patterns can we find in a bursty web traffic? On the 
web or internet graph itself? How about the distributions of galaxies in 
the sky, or the distribution of a company’s customers in geographical 
space? How long should we expect a nearest-neighbor search to take, 
when there are 100 attributes per patient or customer record? The tra- 
ditional assumptions (uniformity, independence, Poisson arrivals, Gaus- 
sian distributions), often fail miserably. Should we give up trying to find 
patterns in such settings? 

Self-similarity, fractals and power laws are extremely successful in de- 
scribing real datasets (coast-lines, rivers basins, stock-prices, brain- 
surfaces, communication-line noise, to name a few). We show some old 
and new successes, involving modeling of graph topologies (internet, web 
and social networks); modeling galaxy and video data; dimensionality re- 
duction; and more. 



Introduction — Problem Definition 

The goal of data mining is to find patterns; we typically look for the Gaussian 
patterns that appear often in practice and on which we have all been trained 
so well. However, here we show that these time-honored concepts (Gaussian, 
Poisson, uniformity, independence), often fail to model real distributions well. 
Further more, we show how to fill the gap with the lesser-known, but even more 
powerful tools of self-similarity and power laws. 

We focus on the following applications: 

— Given a cloud of points, what patterns can we find in it? 

— Given a time sequence, what patterns can we find? How to characterize and 
anticipate its bursts? 

— Given a graph (e.g., social, or computer network), how does it look like? 
Which is the most important node? Which nodes should we immunize first, 
to guard against biological or computer viruses? 

All three settings appear extremely often, with vital applications. Glouds of 
points appear in traditional relational databases, where records with fc-attributes 
become points in k-d spaces; e.g. a relation with patient data (age, blood pres- 
sure, etc.); in geographical information systems (GIS), where points can be, e.g.. 
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cities on a two-dimensional map; in medical image databases with, for example, 
three-dimensional brain scans, where we want to find patterns in the brain ac- 
tivation [ACF+93]; in multimedia databases, where objects can be represented 
as points in feature space [FRM94]. In all these settings, the distribution of k- 
d points is seldom (if ever) uniform [Chr84], [FK94]. Thus, it is important to 
characterize the deviation from uniformity in a succinct way (e.g. as a sum of 
Gaussians, or something even more suitable). Such a description is vital for data 
mining [AIS93],[AS94], for hypothesis testing and rule discovery. A succinct de- 
scription of a fc-d point-set could help reject quickly some false hypotheses, or 
could help provide hints about hidden rules. 

A second, very popular class of applications is time sequences. Time se- 
quences appear extremely often, with a huge literature on linear [BJR94], and 
non-linear forecasting [CE92], and the recent surge of interest on sensor data 
[OJW03] [PBF03] [GGR02] 

Finally, graphs, networks and their surprising regularities/laws have been 
attracting significant interest recently. The applications are diverse, and the dis- 
coveries are striking. The World Wide Web is probably the most impressive 
graph, which motivated significant discoveries: the famous Kleinberg algorithm 
[Kle99] and its closely related PageRank algorithm of Google fame [BP98]; the 
fact that it obeys a “bow-tie” structure [BKM+OO], while still having a sur- 
prising small diameter [AJB99]. Similar startling discoveries have been made 
in parallel for power laws in the Internet topology [FFF99], for Peer-to-Peer 
(gnutella/Kazaa) overlay graphs [RFI02], and for who-trusts-whom in the epin- 
ions.com network [RD02]. Finding patterns, laws and regularities in large real 
networks has numerous applications, exactly because graphs are so general and 
ubiquitous: Link analysis, for criminology and law enforcement [GSH“''03]; anal- 
ysis of virus propagation patterns, on both social/e-mail as well as physical- 
contact networks [WKEOO]; networks of regulatory genes; networks of interact- 
ing proteins [Bar02]; food webs, to help us understand the importance of an 
endangered species. 

We show that the theory of fractals provide powerful tools to solve the above 
problems. 

Definitions 

Intuitively, a set of points is a fractal if it exhibits self-similarity over all scales. 
This is illustrated by an example: Figure 1(a) shows the first few steps in con- 
structing the so-called Sierpinski triangle. Figure 1(b) gives 5,000 points that 
belong to this triangle. Theoretically, the Sierpinski triangle is derived from an 
equilateral triangle ABG by excluding its middle (triangle A’B’G’) and by recur- 
sively repeating this procedure for each of the resulting smaller triangles. The 
resulting set of points exhibits ‘holes’ in any scale; moreover, each smaller trian- 
gle is a miniature replica of the whole triangle. In general, the characteristic of 
fractals is this self- similarity property: parts of the fractal are similar (exactly 
or statistically) to the whole fractal. For our experiments we use 5,000 sam- 
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pie points from the Sierpinski triangle, using Barnsley’s algorithm of Iterated 
Function Systems [BS88] to generated these points quickly. 




(a) construction (b) a finite sample 

Fig. 1. Theoretical fractals: the Sierpinski triangle (a) the first 3 steps of its recursive 
construction (b) a finite sample of it (5K points) 



Notice that the resulting point set is neither a 1-dimensional Euclidean object 
(it has infinite length), nor 2-dimensional (it has zero area). The solution is to 
consider fractional dimensionalities, which are called fractal dimensions. Among 
the many definitions, we describe the correlation fractal dimension, D, because 
it is the easiest to describe and to use. 

Let nb{e) be the average number of neighbors of an arbitrary point, within 
distance e or less. For a real, finite cloud of £’-dimensional points, we follow 
[Sch91] and say that this data set is self-similar in the range of scales ri,V 2 if 

nb{e) (X ri < e < T 2 (1) 

The correlation integral is defined as the plot of nb(e) versus e in log-log scales; 
for self-similar datasets, it is linear with slope D. 

Notice that the above definition of fractal dimension D encompasses the 
traditional Euclidean objects: lines, line segments, circles, and all the standard 
curves have D=l; planes, disks and standard surfaces have D=2] Euclidean 
volumes in E-dimensional space have D = E. 

Discussion — How Frequent Are Self-similar Datasets? 

The reader might be wondering whether any real datasets behave like frac- 
tals, with linear correlation integrals. Numerous the real datasets give linear 
correlation integrals, including longitude-latitude coordinates of stars in the 
sky, population- versus-area of the countries of the world [FK94]; several geo- 
graphic datasets [BF95] [FK94]; medical datasets [FG96]; automobile-part shape 
datasets [BBB+97,BBKK97]. 

There is overwhelming evidence from multiple disciplines that fractal datasets 
appear surprisingly oiten [Man77](p. 447),[Sch91]: 
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— coast lines and country borders (Z? Ri 1.1 - 1.3); 

— the periphery of clouds and rainfall patches {D k, 1.35)[Sch91](p.231); 

— the distribution of galaxies in the universe {D k, 1.23); 

— stock prices and random walks (11=1.5) 

— the brain surface of mammals {D m 2.7); 

— the human vascular system {D = 3, because it has to reach every cell in the 
body!) 

^ even traditional Euclidean objects have linear box-counting plots, with inte- 
ger slopes 

Discussion — Power Laws 

Self-similarity and power laws are closely related. A power law is a law of the 
form 

y = fix) = ( 2 ) 

Power laws are the only laws that have no characteristic scales, in the sense that 
they remain power laws, even if we change the scale: /(c * x) = c“ * 

Exactly for this reason, power laws and self-similarity appear often together: 
if a cloud of points is self similar, it has no characteristic scales; any law/pattern 
it obeys, should also have no characteristic scale, and it should thus be a power 
law. 

Power laws also appear extremely often, in diverse settings: in text, with the 
famous Zipf law [Zip49]; in distributions of income (the Pareto law); in scientific 
citation analysis (Lotka law); in distribution of areas of lakes, islands and animal 
habitats (Korcak’s law [Sch91,HS93,PF01]) in earthquake analysis (Gutenberg- 
Richter law [Bak96]; in LAN traffic [LTWW94]; in web click-streams [MFOl]; 
and countless more settings. 

Conclusions 

Self-similarity and power laws can solve data mining problems that traditional 
methods can not. The two major tools that we cover in the talk are: (a) the “cor- 
relation integral” [Sch91] for a set of points and (b) the “rank-frequency” plot 
[Zip49] for categorical data. The former can estimate the intrinsic dimensionality 
of a cloud of points, and it can help with dimensionality reduction [TTWFOO], 
axis scaling [WF02], and separability [TTPFOl]. The rank-frequency plot can 
spot power laws, like the Zipf’s law, and many more. 
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Propensity score methods were proposed by Rosenbaum and Rubin (1983, Bio- 
metrika) as central tools to help assess the causal effects of interventions. Since 
their introduction two decades ago, they have found wide application in a variety 
of areas, including medical research, economics, epidemiology, and education, 
especially in those situations where randomized experiments are either difficult 
to perform, or raise ethical questions, or would require extensive delays before 
answers could be obtained. Rubin (1997, Annals of Internal Medicine) provides 
an introduction to some of the essential ideas. In the past few years, the number 
of published applications using propensity score methods to evaluate medical and 
epidemiological interventions has increased dramatically. Rubin (2003, Erlbaum) 
provides a summary, which is already out of date. 

Nevertheless, thus far, there have been few applications of propensity score 
methods to evaluate marketing interventions (e.g., advertising, promotions), 
where the tradition is to use generality inappropriate techniques, which focus 
on the prediction of an outcome from an indicator for the intervention and 
background characteristics (such as least-squares regression, data mining, etc.). 
With these techniques, an estimated parameter in the model is used to esti- 
mate some global “causal” effect. This practice can generate grossly incorrect 
answers that can be self-perpetuating: polishing the Ferraris rather than the 
Jeeps “causes” them to continue to win more races than the Jeeps \=l visiting 
the high-prescribing doctors rather than the low-prescribing doctors “causes” 
them to continue to write more prescriptions. 

This presentation will take “causality” seriously, not just as a casual con- 
cept implying some predictive association in a data set, and will show why 
propensity score methods are superior in practice to the standard predictive ap- 
proaches for estimating causal effects. The results of our approach are estimates 
of individual- level causal effects, which can be used as building blocks for more 
complex components, such as response curves. We will also show how the stan- 
dard predictive approaches can have important supplemental roles to play, both 
for refining estimates of individual-level causal effect estimates and for assess- 
ing how these causal effects might vary as a function of background information, 
both important uses for situations when targeting an audience and/or allocating 
resources are critical objectives. 

The first step in a propensity score analysis is to estimate the individual 
scores, and there are various ways to do this in practice, the most common 



N. Lavrac et al. (Eds.): ECML 2003, LNAI 2837, pp. 16-22, 2003. 
(c) Springer- Verlag Berlin Heidelberg 2003 



Taking Causality Seriously 



17 



being logisitic regression. However, other techniques, such as probit regression 
or discriminant analysis are also possible, as are the robust methods based on 
the t-family of long tailed distributions. Other possible methods include highly 
non-linear methods such as CART or neural nets. A critical feature of esti- 
mating propensity scores is that diagnosing the adequacy of the resulting fit is 
very straightforward, and in fact guides what the next steps in a full propen- 
sity score analysis should be. This diagnosing takes place without access to the 
outcome variables (e.g., sales, number of prescriptions) so that that objectivity 
of the analysis is maintained. In some cases, the conclusion of the diagnostic 
phase must be that inferring causality from the data set at hand is impossible 
without relying on heroic and implausible assumptions, and this can be very 
valuable information, information that is not directly available from traditional 
approaches. 

Marketing applications from the practice of AnaBus, Inc. will also be pre- 
sented. AnaBus currently has a Small Business Innovative Research Grant from 
the US NIH to implement essential software to allow the implementation of the 
full propensity score approach to estimating the effects of interventions. Other 
examples will also be presented if time permits, for instance, an application from 
the current litigation in the US on the effects of cigarette smoking (Rubin, 2002, 
Health Services Outcomes Research). 

An extensive reference list from the author is included. These references are 
divided into five categories. First, general articles on inference for causal effects 
not having a focus on matching or propensity scores. Second, articles that focus 
on matching methods before the formulation of propensity score methods - some 
of these would now be characterized as examples of propensity score matching. 
Third, articles that address propensity score methods explicitly, either theoreti- 
cally or through applications. Fourth, articles that document, by analysis and/or 
by simlulation, the superiority of propensity-based methods, especially when 
used in combination with model-based adjustments, over model-based methods 
alone. And fifth, introductions and reviews of propensity score methods. The 
easiest place for a reader to start is with the last collection of articles. 

Such a reference list is obviously very idiosyncratic and is not meant to imply 
that only the author has done good work in this area. Paul Rosenbaum, for 
example, has been an extremely active and creative contributor for many years, 
and his text book “Observational Studies” is truly excellent. As another example, 
Rajeev Deheija and Sadek Wahba’s 1999 article in the Journal of the American 
Statistical Association had been very influential, especially in economics. 
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Abstract. Classical learning algorithms from the fields of artificial neu- 
ral networks and machine learning, typically, do not take any costs into 
account or allow only costs depending on the classes of the examples that 
are used for learning. As an extension of class dependent costs, we con- 
sider costs that are example, i.e. feature and class dependent. We present 
a natural cost-sensitive extension of the support vector machine (SVM) 
and discuss its relation to the Bayes rule. We also derive an approach 
for including example dependent costs into an arbitrary cost-insensitive 
learning algorithm by sampling according to modified probability distri- 
butions. 



1 Introduction 

The consideration of cost-sensitive learning has received growing attention in 
the past years ([9, 4, 5, 8]). As it is stated in the Technological Roadmap of the 
MLnetll project (European Network of Excellence in Machine Learning, [10]), 
the inclusion of costs into learning and classification is one of the most relevant 
topics of future machine learning research. 

The aim of the inductive construction of classifiers from training sets is to find 
a hypothesis that minimizes the mean predictive error. If costs are considered, 
each example not correctly classified by the learned hypothesis may contribute 
differently to the error function. One way to incorporate such costs is the use 
of a cost matrix, which specifies the misclassification costs in a class dependent 
manner (e.g. [9,4]). Using a cost matrix implies that the misclassification costs 
are the same for each example of the respective class. 

The idea we discuss in this paper is to let the cost depend on the single 
example and not only on the class of the example. This leads to the notion of 
example dependent costs, which was to our knowledge first formulated in [6]. 
Besides costs for misclassification, we consider costs for correct classification 
(gains are expressed as negative costs). 

One application for example dependent costs is the classification of credit 
applicants to a bank as either being a “good customer” (the person will pay 
back the credit) or a “bad customer” (the person will not pay back parts of the 
credit loan). 
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The gain or the loss in a single case forms the (mis-) classification cost for that 
example in a natural way. For a good customer the cost for correct classification is 
the negative gain of the bank. I.e. the cost for correct classification is not the same 
for all customers but depends on the amount of money borrowed. Generally there 
are no costs to be expected (or a small loss related to the handling expenses), 
if the customer is rejected, since he or she is incorrectly classified as a bad 
customer. For a bad customer, the cost for misclassification corresponds to the 
actual loss that has been occured. The gain of correct classification is zero (or 
small positive, if one considers handling expenses of the bank) . 

As opposed to the construction of a cost matrix, we claim that using the 
example costs directly is more natural and will lead to the production of more 
accurate classifiers. If the real costs are example dependent as in the credit risk 
problem, learning with a cost matrix means that in general only an approxima- 
tion of the real costs is used. When using the classifier based on the cost matrix 
e.g. in the real bank, the real costs as given by the example dependent costs 
will occur and not the costs specified by the cost matrix. Therefore using exam- 
ple dependent costs is better than using a cost matrix for theoretical reasons, 
provided that the learning algorithm used is able to use the example dependent 
costs in an appropriate manner. 

In this paper, we consider the extension of support vector machines (SVMs, 
[11,2,3]) by example dependent costs, and discuss its relationship to the cost- 
sensitive Bayes rule. In addition we provide an approach for including example- 
dependent costs into an arbitrary learning algorithm by using modified example 
distributions. 

This article is structured as follows. In section 2 the Bayes rule in the case 
of example dependent costs is discussed. In section 3, the cost-sensitive SVM for 
non-separable classes is described. Experiments on some artificial domains can 
be found in section 5. In section 4, we discuss the inclusion of costs by resampling 
the dataset. The conclusion is presented in Section 6. 

2 Example Dependent Costs 

In the following we consider binary classification problems with classes —1 (neg- 
ative class) and 4-1 (positive class). For an example x G R‘^ of class 4-1, let 

— c+i(x) denote the cost of misclassifying x 
~ and g+i(x) the cost of classifying x correctly. 

The functions c_i and 5 _i are equivalently given for examples of class —1. In 
our framework, gains are expressed as negative costs. I.e. gy(x) < 0, if there is a 
gain for classifying x correctly into class y. R denotes the set of real numbers, d 
is the dimension of the input vector. 

Let r : — > {4-1, —1} be a classifier (decision rule) that assigns x to a class. 

According to [11] the risk of r with respect to the distribution function P of (x, y) 
is given by 




( 1 ) 
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The loss function Q is defined by 

Q{^,y,r) 



gy{x) if 2 / = r(x) 
Cy{x) else. 



(2) 



We assume that the density p{x, y) exists. Let Xy = {x \ r(x) = y} the region of 
decision for class y. Then the risk can be rewritten with p(x, y) = p(x\y)P{y) as 



R{r)= j g+i{x)p{x\+l)P{+l)dx+ ( c_i(x)p(x|-l)P(-l)dx (3) 

+ [ 5 _i(x)p(x|-l)P(-l)dx+ [ c+i(x)p(x|+l)P(+l)dx. 

Jx^i Jx^i 

P{y) is the prior probability of class y, and p(x\y) is the class conditional prob- 
ability density of class y. The first and the third integral express the costs for 
correct classification, whereas the second and the fourth integral express the 
costs for misclassification. We assume, that the integrals defining R exist. This 
is the case, if the cost functions are integrable and bounded. 

The risk R{r) is minimized, if x is assigned to class -1-1, if 

S+i(x)p(x|-l-l)P(4-l) -h c_i(x)p(x|-l)P(-l) 

< y_i(x)p(x|-l)P(-l) -h c+i(x)p(x|-hl)P(4-l) 

holds, and to class —1 otherwise. From this, the following proposition is derived. 



Proposition 1 (Bayes Classifier). The function 



r*(x) = sign[(c+i(x) - g+i{x))p{x\+l)P{+l) 
-(c_i(x) - y_i(x))p(x|-l)P(-l)] 



(4) 



minimizes R. 

r* is called the Bayes classifier (see e.g. [1]). As usual, we define sign(O) = -1-1. 
We assume Cy{x) — gy{x) > 0 for every example x, i.e. there is a real benefit for 
classifying x correctly. 

From (4) it follows that the classification of examples depends on the dif- 
ference of the costs for misclassification and correct classification, not on their 
actual values. Therefore we will assume gy{x) = 0 and Cj,(x) > 0 without loss of 
generality. 

Given a training sample (x^^\ . . . , (x^*\ with = Cy{i) 

(a;^®^), the empirical risk is defined by 

d^mp{r) = y ^Q(xW,yW,r). 

It holds Q(x*^*^ , , r) = , if the example is misclassified and Q(x^®^ , , r) = 

0 otherwise. In our case, i?emp corresponds to the mean misclassification costs 
defined using the example dependent costs. 
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Proposition 2 ([11]). If both cost functions are bounded by a constant B, then 
it holds with a probability of at least 1 — rj 



R{r) < Remp{r) + B\ 



h{ln^+l) - In^ 



where h is the VC-dimension of the hypothesis space ofr. 

Vapnik’s result from [11] (p. 80) holds in our case, since the only assumption he 
made on the loss function is its non-negativity and boundedness. 

Let c+i and c_i be the mean misclassification costs for the given distributions. 
Let r+ be the Bayes optimal decision rule with respect to these class dependent 
costs. Then it is easy to see that R{r*) < i?(r+), where R{r*) (see above) and 
i?(r-+) are evaluated with respect to the example dependent costs, l.e. because 
the example dependent costs can be considered to be the real costs occuring, 
their usage can lead to decreased misclassification costs. Of course this is only 
possible if the learning algorithm is able to incorporate example dependent costs. 

In the following, we will discuss the cost-sensitive construction of an r using 
the SVM approach. In the presentation we assume that the reader is familiar 
with SVM learning. 



3 Support Vector Machines 



If the class distributions have no overlap there is a decision rule r* with zero error. 
It holds R{r*) = 0, independent of the cost model used. Since the cost model 
does not influence the optimal hypothesis, we will not consider hard margin 
SVMs in this paper. For soft margin SVMs the learning problem can be stated 
as follows. 

Let S'= {(xW,yW)|i = 1,...,Z} c X {-f-l, — 1} be a training sample and 
Cy(,i){yS^'^) = the misclassification costs defined above. For learning from a 
finite sample, only the sampled values of the cost functions need to be known, 
not their definition. We divide S into subsets S±i which contain the indices of 
all positive and negative examples respectively. By means of 4> : — >■ 'H we 

map the input data into a feature space % and denote the corresponding kernel 
by K The optimization problem can now be formulated as 



min 

w,b,4 



\n 



+ c E c+i(xW)4^ 
ies+1 

+C E c_i(xW)4fc 

iGS_i 



(5) 



s.t. y^''^ • ^i(x^®^) -1-6^ > 1 — 

> 0 , 



(6) 

(7) 



where the regularization constant C > 0 determines the trade-off between the 
weighted empirical risk and the complexity term. 
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w is the weight vector that together with the threshold b defines the classi- 
fication function /(x) = sign(/i(x) + b) with h{x) — w • The slack variable 
is zero for objects, that have a functional margin of more than 1. For objects 
with a margin of less than 1 , expresses how much the object fails to have the 
required margin, and is weighted with the cost value of the respective example. 
^ is the margin slack vector containing all ||w||-h can be interpreted as the 
norm of h. 

With k = 1,2 we obtain the soft margin algorithms including individual 
costs (1-norm SVM and 2-norm SVM). Both cases can be extended to example 
dependent costs. 



1-Norm SVM. Introducing non-negative Lagrange multipliers Ui, > 0, i = 
1 ,...,/, we can rewrite the optimization problem with fc = 1 and resolve the 
following primal Lagrangian 

Lp(w, 6 ,^, 0 ,^) = ^||w||^ 

+C ^ c+i(x«)e. + C c_i(xW)e, 

iGS+i iGS_i 



L L 

(^w • (()(xW) -h 6^ -1 



Taking the derivative with respect to w, 6 and ^ leads to 



dL p 
dii 
dLp 
dii 



dLp 

9w 



i 

= w — ai = 0 

i=l 



dLp 



i 

= 0 

2=1 



Cc+i(x(*^) - ai - yti = 0 ,Vi € S+i 
C'c_i(x^®^) — Oi — /X,; = 0 , Vi G S-1 



(8) 

(9) 

( 10 ) 

( 11 ) 



Substituting ( 8 )-(ll) into the primal, we obtain the dual Lagragian that has to 
be maximized with respect to the ai 



i 

Loia) = 'Y2 ~ 

i=\ 



1 . ^ 



( 12 ) 



Equation (12) is called the 1-norm soft margin SVM. Note that the values of the 
cost function Cy do not occur in Lp>. 

The Karush-Kuhn-Tucker conditions hold, and the corresponding comple- 
mentary conditions are 

e.(Cc+i(x«)-a,) =0, ytGS+i 

^,(Cc_i(x«)-a,) =0,Vx€ .S_i. 



(13) 

(14) 
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Thus the ai are bounded within the so called box constraints 

0 < a, < Cc+i(x(*)), Vi e S+x 
0 < < Cc_i(x(*)), Vi G S-i. 

I.e. in the case of example dependent costs, the box constraints depend 
cost value for the respective example. 

2-Norm SVM. The optimization problem in (7) leads with k = 2 
minimization of the primal Lagrangian 

Lp{w,b,^,a) = ^||w||^ 

Analogous to the 1-norm case, the minimization of the primal is equivalent to 
maximizing the dual Lagrangian given by 

i ^ i 

Ld{oi) = “ 2 ^ aiU^y^yj K{yS'^\yS^^) 

i—1 i,j—l 

_ 1 

In contrast to the 1-norm SVM, L o depends on the values of the costs functions 
Cy. The quadratic optimization problem can be solved with slightly modified 
standard techniques, e.g. [3]. 

3.1 Convergence to the Bayes Rule 

In [7] the cost free SVM learning problem is treated as a regularization problem 
in a reproducing kernel Hilbert space (RKHS) T~Lk 

+ ( 17 ) 

’ i=i 

with /(x) = h(x) -I- b subject to (6), (7). Lin showed in [7] that the solution to 
(17) approximates the Bayes rule for large training sets, if A = is chosen in 
an optimal manner, and the kernel is rich enough (e.g. spline kernels). 
Analogous to Lin we can rewrite the optimization problem in (5) to get 

min + (18) 



fw • </>(x(*^) -I- M -1 -I- . 



(15) 

(16) 

on the 



to the 
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subject to (6), (7), where (6), (7) can be rewritten to 

l- 2 /«/(x«) (19) 

6 > 0. (20) 

We define the function (^)+ = 0, if 2 < 0, and (z)+ = z, else. Then (19) and 
(20) can be integrated into the single inequality 

(l-y«/(xW))+<Ci. (21) 

With this inequality, the minimization problem can be rewritten to 

Ttl 1 ^ - 2/^'V(x«))+)'= + X\\hrn^. (22) 

For I oo, the data driven term converges to 

i5x.y[cy(X)((l-y/(X))+)'=] (23) 

with random variables Y and X. Equation (23) is equivalent to 

Ex[Ev[cy(X)((l - r/(X))+)'=|X]]. (24) 

(24) can be minimized, by minimizing Ey[-] for every fixed X = x giving the 
expression to be minimized 

c_i(x)((l + /(x)) + )'=(l -p(x)) + c+i(x)((l - /(x)) + )'=p(x), (25) 

where p(x) :=p(+l|x). 

According to the proof in [7] it can be shown that the function / that mini- 
mizes (25) minimizes the modified expression 

9 = c-i(x)(l + /(x))'=(l -p(x)) -h c+i(x)(l - /(x))'=p(x). (26) 

By setting z := /(x) and solving = 0, we derive the decision function 

^ [c+i(x)p(x)]’?^ - [c-i(x)(l -p(x))]^^ 

[c+i(x)p(x)]^ + [c_i(x)(l -p(x))]^ 

A random pattern is assigned to class 4-1 if /*(x) > 0 and to class —1 otherwise. 
The above proves the following proposition. 

Proposition 3. In the case k = 2, sign(f*(ii.)) is a minimizer of R, and it 
minimizes (23). It holds 

sign{f*(x)) = r*(x). 

sign(/*(x)) can be shown to be equivalent to (4) in the case fe = 2 by using the 
definition of the conditional density and by simple algebraic transformations. 

It can be conjectured from proposition 3 that SVM learning approximates 
the Bayes rule for large training sets. For k = 1 the corresponding cannot be 
shown. 
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4 Re-sampling 



Example dependent costs can be included into a cost-insensitive learning algo- 
rithm by re-sampling the given training set. First we define the mean costs for 
each class by 

^v= Cy{x)p{x\y)dii. (27) 

We define the global mean cost b = 1). From the cost- 

sensitive definition of the risk in (3) it follows that 



R{r) _ r c_i(x)p(x|-l) r c+i(x)p(x|4-l) P+iP(-l-l) 

& B-i b + P+i b 



I.e. we now consider the new class conditional densities 



p'(x|y) 



1 



Cy{x)p{x\y) 



and new priors 



P\y) = P{y) 



B„ 



P+iP(+l) + P_iP(-l) 



It is easy to see that / p'{x\y)dx = 1 holds, as well as P'(-l-l) -I- 1) = 1- 

Because 6 is a constant, minimizing the cost-sensitive risk R{r) is equivalent 
to minimizing the cost-free risk 



= R'{r) = ( p'(x|-l)P'(-l)dx 



p'(x|4-l)P'(-|-l)dx. 



The following proposition holds. 

Proposition 4. A decision rule r minimizes R' if it minimizes R. 

The proposition follows from R{r) = hR'{r). 

In order to minimize R' , we have to draw a new training sample from 
the given training sample. Assume that a training sample (x^^\ y^^\ . . . , 

(x^^\ y(^\ of size I is given. Let Cy the total cost for class y in the sam- 
ple. Based on the given sample, we form a second sample of size IN by random 
sampling from the given training set, where > 0 is a fixed real number. 

It holds for the compound density 



p'{x,y) = p'{x\y)P'{y) = ^^^^p(x,y). 



(28) 



Therefore, in each of the independent sampling steps, the probability of 

including example i in this step into the new sample should be determined by 

c(d 



c+i + C_i 
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i.e. an example is chosen according to its contribution to the total cost of the 
fixed training set. Note that « b holds. Because of R{r) = bR'{r), it 

holds Remp{r) Ri b ■ where Remp{r) is evaluated with respect to the 

given sample, and R'^^p{r) is evaluated with respect to the generated cost-free 
sample. I.e. a learning algorithm that tries to minimize the expected cost-free 
risk by minimizing the mean cost-free risk will minimize the expected cost for 
the original problem. From the new training set, a classifier for the cost-sensitive 
problem can be learned with a cost-insensitive learning algorithm. 

Re-Sampling from a fixed sample is only sensible, if the original sample is 
large enough. Especially a multiple inclusion of the same example into the new 
training set can cause problems, e.g. when estimating the accuracy using cross 
validation, where the example may occur in one of the training sets and in the 
respective test set. We assume that the re-sampling method is inferior to using 
the example dependent costs directly. Thorough experiments on this point have 
to be conducted in the future. 



5 Experiments 



We have shown in section 2 that the usage of example dependent costs will in 
general lead to decreased costs for classifier application. In section 3 we showed 
that the inclusion of example dependent costs into the SVM is possible and 
sound. To demonstrate the effects of the example dependent costs and the con- 
vergence to the Bayes classifier, we have conducted experiments on two artificial 
domains. The two classes of the first data set where defined by Gaussian distri- 
butions having means fj,±i = (0, ±1)^ and equal covariance matrices = 1 
respectively. The cost functions c±i are defined as follows 



Cy(x) 



2 

1 -I- exp(-yxi) ’ 



y G {+1; 



(29) 



see figure l.a. We used radial basis function kernels for learning. The result of 
learning is also displayed in fig. l.b-d for different number of training examples 
(/ = 128,256,512). 

For the given distributions, and the given cost functions, the expected risk 
is given by 



R = 




27t(1 



0 







2 

27t( 1 -|- e^i) 






The decision boundary is determined by the equality of the two integrands. After 
simple transformations it can be seen that the class boundary is defined by the 
hyperplane -I- 2x2 = 0 £md the optimal Bayes classifier decides in favour of 
class 4-1 if x\ 4- 2x2 > 0 and —1 otherwise. Figure 1 shows the approximation 
of the Bayes classifier for data sets containing 128, 256 and 512 examples with 
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a) cost functions: 



b) I = 128 





c) / = 256 d) / = 512 





Fig. 1. Cost functions and approximation of the Bayes optimal classifier (drawn 
through and dashed line) with I = 128,256,512. The projections of the points on 
the dotted lines lie on the margin hyperplanes. 



individual costs given in (29). The optimal parameter settings were determined 
by cross validation. We do not present the parameter settings, because they are 
not interesting for the purpose of this article. 

The Bayes classifier without costs is defined by the line X 2 = 0. Using class 
dependent instead of example dependent costs results in lines X 2 = 
where c±\ denote the costs for positive and negative examples respectively. In 
contrast to example dependent costs, a rotation of the line is not possible for 
class dependent costs. 

The decision based on class dependent costs is suboptimal for points between 
the lines X\ + 2x2 > 0 and X 2 = For the cost functions in (29), the 

theoretical mean costs are given by c+i = c_i = 1.0. I.e. the decision based on 
class dependent costs is suboptimal with respect to the example dependent costs 
for points between the lines x\ + 2 x 2 > 0 and X 2 = 0. 

An example for using class dependent costs computed as mean costs is shown 
in fig. 2. a. Here the individual costs in (29) were averaged for both classes and the 
resulting means interpreted as class dependent costs c+i = 0.989 and c_i = 0.984 
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Fig. 2. a) Using class dependent, i.e. mean costs (left figure), b) Result for separable 
dataset with the example costs in (29) (right figure). 



respectively. The learned classifier therefore coincides approximately with the 
cost-free Bayes classifier, see fig. 2. a. I.e. the information about the costs is lost 
by using class dependent costs. 

An example of a separable data set with example dependent costs given in 
(29) is shown in fig. 2.b. As expected the resulting classifier is not influenced by 
using the cost functions (29). Note that due to prop. 1 the Bayes classifier r* in 
(4) is defined by the class boundary X 2 = —0.5. Since we defined sign(O) = -hi and 
decide in favour of class -1-1 if r* > 0, all points within the tube —0.5 < X 2 < 0.5 
are assigned to class 4-1 by r*. Allowing an arbitrary choice of the class, if the 
argument of sign in (4) equals to zero, yields a whole set of Bayes decision rules. 
From this set, the SVM has constructed one with maximum margin. 

6 Conclusion 

In this article, we discussed a natural cost-sensitive extension of SVMs by ex- 
ample dependent classification and misclassification costs. The cost-insensitive 
SVM can be obtained as a special case of the SVM with example dependent 
costs. 

We showed, that the Bayes rule only depends on differences between costs 
for correct classification and for misclassification. This allows us to define a 
simplified learning problem where the costs for correct classification are assumed 
to be zero. For the simplified problem, we stated a bound for the cost-sensitive 
risk. A bound for the original problem with costs for correct classification can 
be obtained in a similar manner. 

We have stated the optimization problems for the soft margin support vector 
machine with example dependent costs and derived the dual Lagrangians. For 
the case fc = 2, we discussed the approximation of the Bayes rule using SVM 
learning. However a formal proof of convergence is still missing. 

We suspect that the inclusion of example dependent costs may be sensible in 
the hard margin case too, i.e. for separable classes (fig. 2). It may lead to more 
robust classifiers and will perhaps allow the derivation of better error bounds. 
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Independently from the SVM framework, we have discussed the inclusion of 
example dependent costs into a cost-insensitive learning algorithm by resam- 
pling the original examples in the training set according to their costs. This way 
example dependent costs can be incorporated into an arbitrary cost-insensitive 
learning algorithm. 

The usage of example dependent costs instead of class dependent costs will 
lead to a decreased misclassification cost in practical applications, e.g. credit risk 
assignment. 
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Abstract. This paper presents Abalearn, a self-teaching Abalone pro- 
gram capable of automatically reaching an intermediate level of play 
without needing expert-labeled training examples, deep searches or ex- 
posure to competent play. 

Our approach is based on a reinforcement learning algorithm that is risk- 
seeking, since defensive players in Abalone tend to never end a game. 
We show that it is the risk-sensitivity that allows a successful self-play 
training. We also propose a set of features that seem relevant for achiev- 
ing a good level of play. 

We evaluate our approach using a fixed heuristic opponent as a bench- 
mark, pitting our agents against human players online and comparing 
samples of our agents at different times of training. 



1 Introduction 

This paper presents Abalearn, a self-teaching Abalone program directly inspired 
by Tesauro’s famous TD-Gammon [14], which used Reinforcement Learning (RL) 
methods to learn by self-play a Backgammon evaluation function. We chose 
Abalone because the game’s dynamics represent a difficult challenge for RL 
methods, particularly for methods of self-play training. It has been shown [8] 
that Backgammon’s dynamics are crucial to the success of TD-Gammon, because 
of its stochastic nature and the smoothness of its evaluation function. Abalone, 
on the other hand, is a deterministic game that has a very weak reinforcement 
signal: in fact, players can easily repeat the same kind of moves and the game 
may never end if one doesn’t take chances. 

Exploration is vital for RL to work well. Previous attempts to build an agent 
capable of learning how to play games through reinforcement either use expert- 
labeled training examples [5] or exposure to competent play (online play against 
humans [3] or learning by playing against a heuristic player [5]). We propose a 
method capable of efficient self-play learning for the game Abalone that is based 
on risk-sensitive RL [7]. We also provide a set of features and state represen- 
tations for learning to play Abalone, using only the outcome of the game as a 
training signal. 
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Table 1. Complexity of several games. 



Game 


Branch 


States 


Source 


Chess 


30-40 


-[q5D- 


[4] 


Checkers 


8-10 


lO^" 


[10] 


Backgammon 


±420 


102O 


[18] 


Othello 


±5 


< 10®° 


[20] 


Go 19x19 


±360 


10160 


[11] 


Abalone 


±80 


<3« 


[1] 



The rest of the paper is organized as follows: section 2 briefly analyses the 
game’s complexity. Section 3 refers and explains the most significant previous 
RL efforts in games. Section 4 details the training method behind Abalearn and 
section 5 describes the state representations used. Finally, section 6 presents 
the results obtained using a heuristic player as benchmark, as well as results of 
games against other programs and human expert players. Section 7 draws some 
conclusions about our work. 



2 Complexity in the Game Abalone 

The rules of Abalone are simple to understand: to win, one has to push off the 
board 6 out of the 14 opponent’s stones by outnumbering him/her^. Despite 
this apparent simplicity, the game is very popular and challenging [1]. Table 1 
compares the branching factor and the state space dimension of some zero-sum 
games. The data was gathered from a selection of papers that analyzed those 
games. 

These are all estimated values, since it is very difficult to determine rigorously 
the true values of these variables. Abalone has a branching factor higher than 
Chess, Checkers and Othello, but does not match the complexity of Go. The 
branching factor in backgammon is due to the dice rolls and is the main reason 
why other search techniques have to be used for this game. 

The problem in Abalone is that when the two players are defensive enough, 
the game can easily go on forever, making the training more difficult (since it 
weakens the reinforcement signal). 



3 Related Work 

In this section we present a small survey on programs that learn to play games us- 
ing RL. The most used method is Temporal Difference Learning, or TD-Learning. 
Samuel’s checkers player [9] already used a form of temporal difference learn- 
ing, as well as Michie’s Tic-tac-toe player [6]. They both pre-date reinforcement 
learning as a field, but both basically use the same ideas. 

^ For further information about the games rules and strategies, please refer to the 
Official Abalone Web-site: www.abalonegames.com. 
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3.1 The Success of TD-Gammon 

Tesauro’s TD-Gammon [16] caused a small revolution in the field of RL. TD- 
Gammon was a Backgammon player that needed very few domain knowledge, 
but still was able to reach master- level play [15]. The learning algorithm, a 
combination of TD(A) with a non-linear function approximator based on a neural 
network, became quite popular. 

Besides predicting the expected return of the board position, the neural net- 
work also selected both agent and opponent’s moves throughout the game. The 
move selected was the one for which the function approximator gave the higher 
value. 

Modeling the value function with a neural network poses a number of difficul- 
ties, including what the best network topology is and what the input encoding 
should look like. Tesauro used a number of backgammon-specific features in ad- 
dition to the other information representing the board to increase the informa- 
tion immediately available to the neural network. He found that this additional 
information gave another performance improvement. 

TD-Gammon’s surprising results were never repeated to other complex board 
games, such as Go, Ghess and Othello. Many authors [11,2,8] have discussed 
Backgammon’s characteristics that make it perfectly suitable for TD-learning 
through self-play. Among others, they emphasize: the speed of the game (TD- 
Gammon was trained by playing 1.5 million games), the smoothness of the 
game’s evaluation function which facilitates the approximation via neural net- 
works, and the stochastic nature of the game: the dice rolls force exploration, 
which is vital in RL. 

Pollack and Blair show that a method initially considered weak - training a 
neural network using a simple hill-climbing algorithm - leads to a level of play 
close to the TD-Gammon level [8], which sustains that there is a bias in the 
dynamics of Backgammon that inclines it in favor of TD-learning techniques. 
Although Tesauro does not entirely agree with Pollack and Blair [17], it is quite 
surprising that such a simple procedure works at all. 



3.2 Exposure to Competent Play 

Learning from self-play is difficult as the network must bootstrap itself out of 
ignorance without the benefit of exposure to skilled opponents. As a consequence, 
a number of reported successes are not based on the networks’ own predictions, 
but instead they learn by playing against commercial programs, heuristic players, 
human opponents or even by simply observing recorded games between human 
players. This approach helps to focus on the state space fraction that is really 
relevant for good play, but places the need of an expert player, which is what 
we want to obtain in the first place. 

The Ghess program KnightGap was trained by playing against human oppo- 
nents on an internet chess server [3]. As its rate improved, it attracted stronger 
and diverse opponents, since humans tend to choose partners of the same level 
of play. This was crucial to KnightGap’s success, since the opponents guided 
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KnightCap throughout its training (similar to the dice rolls in backgammon, 
which facilitated exploration of the state space). Thrun’s program, NeuroChess 
[19], was trained by playing against GNUChess, a heuristic player, using TD(0). 

Dahl [5] proposes an hybrid approach for Go: a neural network is trained to 
imitate local game shapes made by an expert database via supervised learning. 
A second net is trained to estimate the safety of groups of stones using TD(A), 
and a third net is trained, also by TD(A)-Learning to estimate the potential of 
non-occupied points of the board. 



4 Abalearn’s Training Methodology 

Temporal difference learning (TD-learning) is an unsupervised RL algorithm 

[12] . In TD-learning, the evaluation of a given position is adjusted by using the 
differences between its evaluation and the evaluations of successive positions. 

Sutton defined a whole class of TD algorithms which look at predictions of 
positions which are further ahead in the game and weight them exponentially 
less according to their temporal distance by the parameter A. 

Given a series of predictions, Vq, ..., Vj, V)+i, then the weights in the evalua- 
tion function can be modified according to: 

t 

Awt = a{Vt+,-Vt)Y,>^'~"^wVk ( 1 ) 

fc=i 

TD(0) is the case in which only the one state preceding the current one is 
changed by the TD error (A = 0). For larger values of A, but still A < 1, more 
of the preceding states are changed, but each more temporally distant state is 
changed less. We say that earlier states are given less credit for the TD error 

[13] . 

Thus, the A parameter determines whether the algorithm is applying short 
range or long range prediction. The a parameter determines how quickly this 
learning takes place. 

A standard feed- forward two-layer neural network represents the agent’s eval- 
uation function over the state space and is trained by combining TD(A) with 
the Backpropagation procedure. We used the standard sigmoid as the activation 
function for the hidden and output layers’ units. Weights are initialized to small 
random values between —0.01 and 0.01. 

Rewards of -1-1 are given whenever the agent pushes an opponent’s stone off 
the board or whenever it wins the game. When the agent loses the game or when 
the opponent pushes an agent’s stone the reward is -1, otherwise it is 0 

^ Another option would be to give a positive reward only at the end of the game 
(when six stones have been pushed off the board). The agent would be able to learn 
to “sacrify” stones in order to improve its position. This option has not been used 
in the present paper, in part because we believe that a value function that take 
into account sacrifices must be much more difficult to approximate. This may be an 
interesting direction for future work. 
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One of the problems we encountered was that self-play was not effective be- 
cause the agent repeatedly kept playing the same kind of moves, never ending 
a game. When training is based on self-play, the problem of exploration is very 
important because the agent may restrict itself to a small portion of the state 
space and become weaker and weaker, because the opponent is itself. This char- 
acteristic is not specific to the Abalone but applies to any agent that learns by 
self-play. 

One way to favor exploration of the state space is to use an e-greedy policy. 
During training, the agent follows an e-greedy policy, selecting a random action 
with probability e and selecting the action judged by the current evaluation 
function as having the highest value with probability 1 — e. The drawback of this 
solution is that it introduces noise in the policy “blindly” i.e. without taking 
into account the value of the current state. 

The solution was to provide the agent with a sensitivity to risk during learn- 
ing. Mihatsch and Neuneier [7] recently proposed a method that can help accom- 
plish this. Their risk-sensitive RL algorithm transforms the temporal differences. 
In this approach, k G (—1,1) is a scalar parameter which specifies the desired 
risk-sensitivity. The function 



is called the transformation function, since it is used to transform the temporal 
differences according to the risk sensitivity. The risk sensitive TD algorithm 
updates the estimated value function V according to 



When /t = 0 we are in the risk-neutral case. If we choose k to be positive 
then we overweight negative temporal differences 



with respect to positive ones. That is, we overweight transitions to states where 
the immediate return i?(s, a) happened to be smaller than in the average. On the 
other hand, we underweight transitions to states that promise a higher return 
than in the average. In other words, the agent is risk-avoiding when k > 0 and 
risk-seeking when k < 0. We discovered that negative values for k lead to an 
efficient self-play learning (see section 6). 

When a neural network function approximator is used with Risk-Sensitive 
Reinforcement Learning, the TD(A) update rule for parameters becomes: 




(1 — k)x if a: > 0, 
(1 -I- k)x otherwise 



(2) 



^t(st) — Vt-i(st) + ax''[i?(st, at) + 7 Vt_i(st+i) — Vj_i(st)] (3) 



E(st, at) + jV(st+i) - V(st) < 0 



( 4 ) 




( 5 ) 



with 



dt = R{st,at) + ^V{st]w) - V{st-i;w) 
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@ Border ^ Middle @ Center 

Fig. 1. The architecture used for Abalearn 2 encodes: the number of stones in the 
center, in the middle, in the border and pushed off the board (left) and the same for 
the opponent’s stones. Abalearn 3 adds some basic features of the game (right). 

5 Efficient State Representation 

The state representation is crucial to a learning system, since it defines every- 
thing the agent might ever learn. In this section, we describe the neural network 
architectures we implemented and studied. 

Let us first consider a typical architecture that is trained to evaluate board 
positions using a direct representation of the board. We call the agent using this 
architecture Abalearn 1. It is a basic and straightforward state representation, 
since it merely describes the contents of the board: it maps each position in the 
board to -1 if the position contains an opponent’s stone, -1-1 if it contains an 
agent’s stone and 0 if it is empty. It also encodes the number of stones pushed 
off the board (for both players). 

We wish the network to achieve a good level of play. Clearly, this task can 
be better accomplished by exploiting some characteristics of the game that are 
relevant for good play. We used a simple architecture that encodes the number 
of stones in the center, in the middle, in the border and pushed off the board (see 
Figure 1); and the same for the opponent’s stones. The state is thus represented 
by a vector of 8 features, plus a bias input unit set to 1. We called this agent 
Abalearn 2. This network is quite a simple feature map, but it is capable of 
learning to play Abalone, as we will see in the next section. 

We then incorporated into a new architecture (Abalearn 3) some extra hand- 
crafted features, illustrated in Figure 1. Abalearn 3 adds some relevant (although 
basic) features of the game to the previous architecture. We added: protection 
(number of stones totally surrounded by stones of the same color), the average 
distance of the stones to the center of the board and the number of stones 
threatened (see Figure 1). 

6 Results 

In this section we present the results of two training methods. Common param- 
eter values in both methods are: a = 0.1,7 = 0.9. Unless specified, the value of 
the A parameter was 0.7. 
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Fig. 2. Comparison between some reference networks, sampled after 10, 250 and 2750 
training games (average of 500 games) shows that learning is succeeding. 



Method I. This method applies standard TD(A) using Abalearn 2, described in 
the previous section. In method I, the agent plays 1000 games against a random 
opponent in order to extract some basic knowledge (mainly learning to push 
the opponent’s stones off the board). After that phase, we train the agent using 
self-play. This method never succeeds when using self-play training from the 
beginning. 

Method 1(a). This method is the same as Method I. Only the state represen- 
tation changes to Abalearn 3. This method is necessary to prove the benefit of 
the added features in Abalearn 3 with respect to Abalearn 2. 

Method II. We wished to obtain an agent capable of efficient and automatic 
self-play learning. Method II accomplishes this. It applies the risk-sensitive ver- 
sion of TD(A) using self-play and Abalearn 3, also described in the previous 
section. Exploration is important especially at the beginning of the train, so we 
used a decreasing e: after each game t, Ct+i = 0.99 x et, with cq = 0.9. 

Testing Methods. The most straightforward method for testing our agents 
is by averaging their win rate against a good heuristic^ player. The heuristic 
function sums the distance to the center of the board of each stone (subtracts 
if it’s an opponent stone). We also tested our agents by playing some games 
against the best Abalone program and by making them play at the Abalone 
Website against human experts. 

6.1 Method I: Standard TD(A) 

We tested our networks against three networks sampled during previous train- 
ing. Figure 2 shows the results. Each curve represents an average over 500 games. 
Each network on the X-Axis plays against Net 10, Net 250 and Net 2750 (net- 
works sampled after 10, 250 and 2750 training games respectively). As we can 

^ We use a simple Minimax search algorithm. 
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Training Games 



Fig. 3. Performance of the agents when trained against different kinds of opponents. 



Table 2. Comparison between the two methods (Win Rate against Heuristic Player). 



Training Games 


Method I 


Method 1(a) 


500 


48% 


68% 


1000 


52% 


72% 


2000 


54% 


76% 


3000 


71% 


79% 



see, it is easy for the networks to win Net 10. On the other hand, Net 2750 is 
far superior to all the others. 

Exposure to Competent Play. A good playing partner offers knowledge to 
the learning agent, because it easily leads the agent through the relevant fractions 
of the state space. 

In this experiment, we compare agents that are trained by playing against a 
random opponent, a strong minimax player and by self-play. Figure 3 summarizes 
the results. Each point corresponds to an average over 500 games against the 
heuristic opponent. We can see that a skilled opponent is more useful than a 
random opponent, as expected. 

The Benefit of the Features. Table 2 compares the two state representations: 
it presents the win percentage against a heuristic player over 100 testing games, 
using method I and 1(a). The agent trained with method 1(a) uses the state 
representation with added features (see section 5) and after only 1000 games of 
training, presents a better performance than the agent trained with method I. 
This proves the features added were relevant to learning the game and yielded 
better performances. 

6.2 Method II: Self-play with Risk- Seeking TD(A) 

Figure 4 shows the results of training for four different risk sensitivities: k = —1 
(the most risk-seeking agent), k = —0.8, k = —0.3 and k = 0 (the classical risk- 
neutral case). We trained and tested 10 agents. We can see that performance is 
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Fig. 4. Performance of the risk-sensitive RL agents when trained by self-play for various 
values of risk-sensitivity. Self-play is efficient for negative values of risk-sensitivity. 
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Fig. 5. Improvement in performance of the risk-seeking self-playing agent { k , = — 1). 



best when k = —0.8 and k = —1. We verified that after 10000 games of self-play 
training with k = —1 performance kept the same (see Figure 5, which plots the 
results for the first 2000 games). By assuming that losses are inevitable, the agent 
ignores most of the negative temporal differences and the weights associated to 
the material advantage are positively rewarded. 

We trained the agent with k = 0 and it didn’t learn to push the opponent’s 
stones, thereby losing most games agianst the heuristic player, except for 1 out 
of 10 runs of the experiment. This is because the lack of risk-sensitivity leads 
to highly conservative policies where the agent learns to maintain its stones in 
the center of the board and avoids to push opponent’s stones. This experiment 
illustrates the importance of risk-sensitivity in self-play learning: in method 1(a), 
performance is worse (see Table 2). 

Performance against the best program. We wanted to evaluate how TD- 
learning fares competitively against other methods. ABA-PRO, a commercial 
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Table 3. Abalearn using method I with fixed 1-ply search depth only loses when the 
opponent’s search depth is 6-ply. Method II performs better. 



Method I Depth=l vs.: 


Stones Won 


Stones Lost 


Moves 


First Move 


ABA-PRO Depth=4 


0 


0 


31 


ABA-PRO 


ABA-PRO Depth=5 


0 


0 


23 


ABA-PRO 


ABA-PRO Depth=6 


0 


2 


61 


ABA-PRO 


Method II Depths 1 vs.: 


Stones Won 


Stones Lost 


Moves 


First Move 


ABA-PRO Depth=4 


0 


0 


29 


ABA-PRO 


ABA-PRO Depth=5 


0 


0 


21 


ABA-PRO 


ABA-PRO Depth=6 


0 


0 


42 


ABA-PRO 



Table 4. Abalearn playing online managed to win intermediate players. 



Abalearn Method I vs.: 


Stones Won 


Stones Lost 


First Move 


ELO 1448 (weak intermediate) 


6 


1 


Human Player 


ELO 1590 (strong intermediate) 


3 


6 


Human Player 


ELO 1778 (expert) 


0 


6 


Human Player 


Abalearn Method II vs.: 


Stones Won 


Stones Lost 


First Move 


ELO 1501 (intermediate) 


2 


0 


Human Player 


ELO 1500 (intermediate) 


6 


1 


Human Player 


ELO 1590 (strong intermediate) 


6 


1 


Human Player 


ELO 1590 (strong intermediate) 


6 


3 


Human Player 


ELO 1590 (strong intermediate) 


6 


4 


Human Player 


ELO 1590 (strong intermediate) 


6 


4 


Human Player 



application, that is one of the best Abalone computer players built so far [1] 
relies on sophisticated search methods and hand-tuned heuristics that are hard 
to discover. It also uses deep, highly selective searches (ranging from 2 to 9-ply). 
Therefore, we pitted Abalearn trained as described before against ABA-PRO. 

Table 3 shows some results obtained varying the search depth of ABA-PRO 
and maintaining our agent performing a fast 1-ply search^. The free- version is 
limited to 6-ply search. 

As we can see, Abalearn only loses 2 stones when its’ opponent search depth 
is 6. This shows that it is possible to achieve a good level of play using our 
training methodology. Once again, method II performs better (never loses). 

Performance against Human Experts. To better assess Abalearn’s level of 
play, we made it play online at the Abalone Official Server. As in all other games, 
players are ranked by their ELO. 

Table 4 shows the results of some games played by Abalearn online against 
players of different ELOs. Method I won a player with ELO 1448 by 6 to 1 
and managed to lose by 3 to 6 against an experienced 1590 ELO player. When 

When the game reaches a stage where both players repeat the same moves for 20 con- 
secutive times, we end the game (tie by repetition). We carried out this experiment 
manually because we didn’t implement an interface between the two programs. 
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playing against a former Abalone champion, Abalearn using method I lost by 

6 to 0, but it took more than two hours for the champion to beat Abalearn, 
mainly because Abalearn defends very well and one has to try to ungroup its 
stones slowly towards a victory. 

Method II is more promising because of its incorporated extra features^. We 
have tested it against players of ELO 1501, 1500 and 1590 (see Table 4). 

7 Conclusions 

This paper describes a program, Abalearn, that learns how to play the game of 
Abalone using the TD(A) algorithm and a neural network to model the value 
function. The relevant information given to the learning agent is limited to the 
reinforcement signal and a set of features that define the agent’s state. The 
programs learns by playing against itself. 

We showed that the use of a Risk-Sensitive version of the TD(A) algorithm 
allows the agent to learn by self-play. The performance level of Abalearn is eval- 
uated against a heuristic player, a commercial application and human players. 
In all cases Abalearn shows a promising performance. The best agent wins about 
90% of the games against the heuristic player and ties against strong opponents. 
Our agent only uses a single-step lookahead. One possible direction for further 
work is to integrate search with RL as Baxter et al. have shown [2]. 
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Abstract. In this paper, an adaptive news event detection method is proposed. 
We consider a news event as a life form and propose an aging theory to model 
its life span. A news event becomes popular with a burst of news reports, and it 
fades away with time. We incorporate the proposed aging theory into the tradi- 
tional single-pass clustering algorithm to model life spans of news events. Ex- 
periment results show that the proposed method has fairly good performance 
for both long-running and short-term events compared to other approaches. 



1 Introduction 

Nowadays, the Web has become a huge information treasure. Via the simple Hyper 
Text Markup Language (HTML) [1], people can publish and share valuable knowl- 
edge conveniently and easily. However, as the number of Web documents increases, 
obtaining desired information from the Web becomes time-consuming and sometimes 
requires specific knowledge to make best use of search engines and returned results. 
On-line news reflects such an information explosion problem. It is difficult to access 
and assimilate desired information from the hundreds of news documents from differ- 
ent agencies generated per day. Techniques such as classification [7] [9] and personal- 
ization [5][6], were invented to facilitate news reading. However, the classification 
method is not totally effective in that readers generally follow news by interesting 
threads, not categories. Moreover, unexpected events, such as accidents, awards and 
sport championships, are out of the learned user profile. Therefore, to reduce search 
time and search results a precise event detection method, which discovers news 
events automatically, is necessary. 

Event detection is part of Topic Detection and Tracking (TDT) [2] in which a news 
event is defined as incidents that occur at some place and time associated with some 
specific actions. In contrast with a category in the traditional text classification, 
events are localized in space and time. The job of event detection is to find out new 
events in several news streams. Besides discussing the TDT techniques of on-line 
news, in this paper we also discuss one interesting issue about news events — the 
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event life cycle. Usually, new news events appear in a news burst and gradually die 
out as time goes on [8]. Ignoring temporal relations of news events will degrade the 
performance of a TDT system. Previous works [3][14] were aware of the importance 
of the temporal information of news events to TDT. Their experimental results 
showed that modeling temporal information of news events could discriminate be- 
tween similar but distinct events efficiently. In this paper, we propose the concept of 
aging theory to model life cycles of news events. Experiments show that our ap- 
proach can improve the deficiencies of other methods. 

The rest of the paper is organized as follows. In Section 2, we give a review of re- 
lated works. In Section 3, we propose the concept of aging theory. Section 4 describes 
the algorithms that apply the aging theory to a news reading system. We evaluate the 
system performance in Section 5. Finally, conclusions and future work are given in 
Section 6. 

2 Related Works 

The project Topic Detection and Tracking (TDT) [2] is a DARPA-sponsored activity 
to detect and track news events from streams of broadcast news stories. It consists of 
three major tasks: segmentation, detection and tracking. Our focus, retrospective 
detection task [3][14], is unsupervised learning oriented [11]. Without giving any 
labeled training examples, the job of retrospective detection is to identify events from 
a news corpus. The traditional hierarchical agglomerative clustering (HAC) algorithm 
[13] is suitable for retrospective detection. However, the computation cost of HAC, 
which is quadratic to the number of input documents when using group average clus- 
tering [14], makes it infeasible when the number of news documents per day is high. 
Yang, et al. [14] used the technique of bucketing and re-clustering to speed up HAC. 
However, there is a chance that information from a long running event would be 
spread over too many buckets and thus divide the event into several events [14]. 

Another popular approach to retrospective detection is single-pass clustering (or 
incremental clustering) [4]. The single-pass clustering method processes the input 
documents iteratively and chronologically. A news document is merged with the most 
similar detected-cluster if the similarity between them is above a pre-defined thresh- 
old; otherwise, the document is treated as the seed of a new cluster. However, by only 
considering the similarity between clusters and documents will lead context-similar, 
but event-different, stories to be merged together. In order to obtain better clusters, 
temporal relations between news documents (or clusters) must be incorporated into 
the clustering algorithm. Allan, et al. [3] proposed a time-based threshold approach to 
model the temporal relation. By increasingly raising the detection threshold, distant 
documents are difficult to align with existing clusters. Therefore, different events 
could be discerned. Yang, et al. [14] modeled the temporal relation in a time window 
and a decaying function. The size of a time window specifies the number of prior 
documents (or events) to be considered when clustering. The decaying function 
weights the influence of a document in the window based on the gap between it and 
the examined document. Similar to the time-based threshold approach, distant docu- 
ments in the time window make less impact on clustering than those nearby. 
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Even though the above methods enhance the result of the single-pass clustering al- 
gorithm, they are not adaptable for all types of event detections. The increasing 
threshold of time-based threshold method keeps distant stories of long-running events 
from tracking while the large window size of the time window method may mix up 
many expired, context-similar, short-term events. In order to balance the tradeoff, and 
tackle both long-running and short-term events, a self-adaptive event life cycle man- 
agement mechanism is necessary. We present an aging theory for event cycle in Sec- 
tion 3. For more information about TDT, [4] gives a detailed survey of existing sys- 
tems and approaches in recent years. 



3 Aging Theory 

A news event is considered a life form with stages of birth, growth, decay and death. 
To track life cycles of events, we use the concept of energy function. Like the en- 
dogenous fitness of an artificial life agent [10], the value of energy function indicates 
the liveliness of a news event in its life span. The energy of an event increases when 
the event becomes popular, and it diminishes with time. Therefore, a function of the 
number of news documents can be used to model the growing stage of events. On the 
other hand, to model the process of diminishing or aging stages, a decay factor is 
required. 

3.1 Notations and Definitions 

The news documents to an event is analogous to foods to a life form. As various 
foods do not contribute the same nutrition to a life form, different news documents 
make different contributions to an event’s liveliness (i.e. popularity). The degree of 
the similarity between a news document and an event is used to represent the nutrition 
contribution. The accumulated similarity between news documents and event U in a 
time slot t is denoted by x,. The time slot t can be any time interval. In the implemen- 
tation, we use one day as a time slot. 

We then define or as the nutrition transferred factor and (i as the nutrition decayed 
factor, 0<a<l, 0<y5<l, and y, = a, ff). a decides the increase of nutrition 

from an input news document and fi decides the nutrition loss in a period. The net 
nutrition y, is a compound variable consisting of the nutrition of each time slot x^,...,x,, 
Grand fi. Different g()s mean different efficiencies of nutrition for different events. A 
function F(y) has the following properties and is called the energy function of V: 



0< F(y) < 1 


( 1 ) 


F(y) is a strictly increasing function of y 
F(oo) = 1 and F(0) = 0 


( 2 ) 



The problem of event life cycle management is to find the optimal combination of 
a and fi such that the energy value is 1 when all news documents of the event V ap- 
pear. However, by Equation (2), the energy value would never be 1. Therefore, we 
loosen the equation and redefine the optimal condition as 
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F(r-y^) = s ( 3 ) 

where 

r is a proportion of 2.^^ jiccc); 

^ is a constant; 

T is the number of time slots. 

Both r and s are selected by the users. 



3.2 Growth Only 

One extreme case of the event life cycle is no decay, which means the energy of the 
event will be accumulated with the clustering of related news documents. In this case, 
y^ is simply the count of related news documents. Formally, we let y, = ,(C0C;). We 

want Equation (3) to hold, so 

= (4) 

Since the f is a strictly increasing function of y, we can take the inverse function 
F' for both sides of Equation (4); 

We then divide both sides by rE.^j to solve a, 

a* = (5) 



3.3 Constant Decay 

The extreme case described above is not very likely in real world systems because the 
energy of an event should not only grow but also eventually diminish with age. 
Hence, we present the constant decay method which subtracts a constant value for 
each time slot in this section. 

Formally, the domain of the energy function of constant decay is defined as fol- 
lows: 

>’r = = ^Z/=l (6) 

There are two parameters in Equation (6); hence we need two equations to solve 
them. Let r,, denote two proportions of 2",^^ Then 

=«nZi=i (7) 

and 

( 8 ) 

As in the previous section, we substitute Equations (7) and (8) into the optimal 
condition (3), take the inverse function F' of both sides and get 
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(9) 



and 






( 10 ) 



Solve Grand y^by (9) and (10): 

a = [t2F~\si)-t^F~\s2)]/[(r^t2 - f2h)X,=i 



( 11 ) 



and 



j3‘ = {r,[t,F-\s,)-t,F-\s,mnh-r,t,)-F-\s,)}/t, 



( 12 ) 



4 Representations and Algorithms 

4.1 News Document Representation 

Both news documents and events were represented as vectors in the conventional 
vector space model (VSM) [13], but each has a different scheme to determine term 
weights. For news documents, we use the traditional TF-IDF [13] scheme for term 
weighting, which is defined as 



where 

w,j is the weight of term t in document d\ 

is the within-document term frequency (TF); 
log(N/dfJ is the inverted document frequency (IDF), 

N is the number of documents in the system corpus, 

df, is the number of documents in the corpus which t occurs. 

The term weights of a news event are obtained from a set of detected documents. 
However, due to the temporal relation of news documents, the event’s weights must 
be updated progressively to reflect event evolution. We adopt the classic Rocchio 
method [12] to update the term weights of events incrementally. 



where 

w,, is the weight of term t in the detected event e; 
w, j is the weight of term t in the inserted document d\ 
y is a parameter between 0 and 1 . 

Simply, the term weight of an event is a weighted combination of its original term 
weight and the term weights of newly detected news documents. Besides the term 
vector, we also assign each event a real number eng, called energy value, to indicate 




(13) 






(14) 
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its vitality. The energy of an event increases when the event is popular, and it de- 
creases with a constant value for each period of time. Therefore, events that receive 
little interest will gradually fade out. 



4.2 Event Detection Algorithm 

Based on the aging theory described in Section 3, the energy-based event detection 
algorithm is as in Figure 1. £ is a set of candidate news events which are detected by 
the algorithm. Initially, E is set as empty. For each incoming news document d, the 
similarity between d and the most similar detected event in E is examined against a 
predefined threshold, called thresholdj^^^^. The similarity is the cosine function be- 
tween the vectors of news document and news event. 



Energy-b.n.sed Event Detection Algoritlun: 

E - null. 

For each news document isf from on-line news stream 
e = ARGMAX,£:E(sim(0,(i)); 

If sim(e,^0 >= threshold^eiea. then 
e. Energy Ipdate {d ) ; 
e. Vkctorlpdate((l)', 
else 

^rxv - CrsateNewsEventid)', 
add into E, 

end if 
end for 



Fig. 1. The energy-based event detection algorithm. 

If the similarity is greater than the threshold, we classify the news into the event 
and update the event’s term vector and energy value. EnergyUpdateQ increases the 
energy value of the corresponding event. Intuitively, the similarity values between the 
event and its documents can be summed as the energy value. However, this un- 
bounded value will cause hot event, consisting of a burst of news documents, to have 
huge energy values and a longer life span than the event actually has. To overcome 
this pitfall, the energy function defined in Section 3 is used to constrain the energy 
value. The formula of e.EnergyUpdate(d) is defined as 

eng,, = F{F^‘{engJ + a- sim{e,d)) 

where 

eng^ is the energy value of the event e; 

F() is the energy function; 

F"*() is the inverse energy function; 

sim(e,d) is the cosine value between the event e and the document c/; 

oris the energy transferred factor. 
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In this study, we adopt a sigmoid function as the energy function which converts 
the sum of similarities into a bounded extent. The sigmoid function is defined as: 
iQy , v>0 



F{y) = 



l + lOy 



(16) 



=0, otherwise. 

One of the distinguishing features of the sigmoid function is that it maps a very 
large input domain into a small output domain. As shown in Figure 2, the output 
ranges between 0 and 1 . Therefore, by using the sigmoid function, energy values can 
be limited between 0 and 1, which is consistent to the definition of the energy 
function described in Section 3. Another interesting feature of the sigmoid function is 
that it is a nonlinear function. The curve is much steeper around the origin than the 
extremities. This kind of growth often reflects the development of an event. It is 
usually accompanied by a burst of news documents in the beginning, and then gradu- 
ally fades away. Since the energy value is constrained, we can interpret the status of 
the event by partitioning the range of the output of the sigmoid function. In practice, 
we can divide the output of the sigmoid function into three parts, each part represent- 
ing a different situation of an event. High sigmoid values indicate hot events, and low 
sigmoid values indicate that events are out of date. 

After increasing the energy value, the function VectorUpdate( ) is called to capture 
up-to-date event status. We use the above Rocchio formula [12] to update the term 
vector of the detected event. That is, we assume the inserted document d is positive 
feedback and use it to adjust the term weights of the event. 

In case the similarity between the incoming document d and the most similar event 
is smaller than thresholdj^^^ or the set of E is empty, the detection algorithm will 
create a new event. The function CreateNewsEventQ forms the term vector of the 
newly created event by copying it from the input news document. 




Fig. 2. The graph of the sigmoid function. 



4.3 Energy Decay Algorithm 

In contrast to the energy-based event detection algorithm which generates new events, 
the energy decay algorithm shown in Figure 3 is used to remove antiquated events. 
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Since events are time dependent, an event detection system is defective unless it can 
remove expired events. The energy decay algorithm periodically (e.g., every mid- 
night) checks the energy values of events and removes antiquated events to keep all 
events in a news reading system up to date. 

The energy value of every detected event is periodically reduced with a decay fac- 
tor P, and the value of P is calculated by the aging theory. When no or few documents 
are added to an event, its energy value will gradually decline. Moreover, if an event’s 
energy value is lower than a predefined threshold, called threshold we suppose 
the event is out of date and remove it from the validated set E. With the energy-based 
event detection algorithm and the energy decay algorithm, the lifespan of a news 
event can be determined by the liveliness of the event. The more related news docu- 
ments it has, the longer its lifespan. This makes life cycle management of news events 
self-adaptive. 



Eiiei g>' Decay Algoiitlun: 

For each event e mS 

engg = -P 
if engf <= threshold^^ then 
Remove e from E, 

End if 

End for 



Fig. 3. The energy decay algorithm. 



5 Empirical Evaluations 

Two experiments were designed to evaluate and verify the proposed theory. In the 
first experiment, the training part of our data corpus is used to acquire the optimal 
aging parameters. We use the learned parameters to plot the variation of energy val- 
ues of events in the testing part of the corpus. Some interesting observations from this 
experiment are discussed below. Then, we evaluate the performance of our method 
against three other methods in terms of traditional TDT metrics. The experimental 
results show that our method is more adaptive than others in both long-running and 
short-term events. 



5.1 Data Corpus 

Table 1 details the corpus we made by collecting news documents from several on- 
line news agencies for evaluations. We forgo the TDT pilot study corpus [4] for 
evaluations because it does not offer us a set of training data to obtain the aging pa- 
rameters. Moreover, we believe that the aging parameters are category-derived. Each 
category will have its own best aging parameters in relation to its events. Therefore, 
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we compile a corpus that comprises events of all sorts of categories to conclude our 
purpose. In this study, 18 events in politics are used for evaluations. In the future, 
events of other categories will he used as well. Besides, we have categorized types of 
events based on the event period. Events were identified as short-term events if they 
vanished within three days. In contrast, if the life of an event lasted over a week, we 
call the event a long-running event. Categorizing the type of events could help us in 
discussing the strength of each of the comparing methods in different situations. 



Table 1. Statistics of data corpus. 





Training Data 


Testing Data 


Start-end date 


2002/10/1 -2002/10/31 


2002/11/1 -2002/12/31 


Number of news docu- 


13,267 


30,256 


ments 

Number of labeled events 


8 


10 


Event period <= 3 


2 


1 


7 >= event period > 3 


2 


3 


Event period > 7 


4 


6 



5.2 Experiment 1: Growth of Energy Values of News Events 

This experiment inspects the effect of aging parameters on the life cycle of news 
events. According to formulas (11) and (12), the values of aand fi are determined by 
the points and We set r, and as 0.20 and and as 0.85 and the corre- 
sponding and are replaced by the times where the sum of similarity exceeds the 
and portion of the sum of similarities between the event and all its documents. The 
final values of aand fi, 0.118659 and 0.145198 respectively, are the averages of the 
results of all training events. 




Figure 4 shows the energy values and numbers of news documents per day of 
some of the testing events. We chose these events because they are difficult tasks in 



56 Chien Chin Chen et al. 



event detection. The event in the left began with very few related news documents, 
and it immediately quieted for a while until some follow-up news occurred. During 
the quiet period, it is hard for an event detection system to identify whether the event 
has ended, especially if it had a weak beginning. An early death announcement will 
cause serious errors since there were plenty of follow-ups. However, holding all weak 
beginning events will overly emphasize the importance of weak events, which may 
cause false alarms. Fortunately, our aging theory could tackle this kind of dilemma. 
Since the sigmoid function is a nonlinear function, the steeper slope near the origin 
point gives an event a higher vitality even if that event is inactive at an early stage. As 
a result, the high initial energy value helps the event survive the quiet period to track 
the follow-ups. What if the event is indeed a weak event, as Event 15 in figure 4? In 
this case, our decay mechanism could eliminate this event after a short period of time. 
As shown in figure 4, this event only lasted for two more days. Observing the contrast 
between events in Figure 4, we found that our aging theory could ascertain the rise 
and fall of an event progress. 



5.3 Experiment 2: Event Detection Comparisons 

In this experiment, our aging method (A) is compared to three proposed methods. The 
baseline method (B) [4] is a basic single-pass clustering algorithm. The time-based 
threshold method (7) [3] and the time window method (W) [14] enhanced the single- 
pass clustering algorithm with temporal information. Each of the above methods 
groups the testing part of the corpus into several clusters. The top ten best-matched 
clusters generated from each method were chosen for performance comparisons. The 
degree of match between a testing event and a generated cluster is determined by the 
number of news documents belonging to both the testing event and the cluster. As a 
result, 10 generated clusters from each method are evaluated using six official TDT 
measures [4] including: precision (p), recall (r), miss (m), false alarm (/), Fl-measure 
(FI) and cost (c). 

Table 2 shows the results of the experiments in which, T0.02, T0.05, and TO. 7 are 
time-based threshold methods with 0.02, 0.05 and 0.1 time penalty [3] respectively. 
W2000, W2000d, and W3000d are the time window methods with window sizes [14] 
of 2000 and 3000 respectively. The lowercase letter d indicates that the window is 
decaying-based [14]. 



Table 2. Experiment results on testing events. 






P 


r 


m 


f 


FI 


c 


B 


0.68 


0.75 


0.25 


0.0007 


0.63 


0.006 


A 


0.73 


0.75 


0.24 


0.0002 


0.72 


0.005 


TO. 02 


0.88 


0.58 


0.41 


0.00008 


0.67 


0.008 


TO. 05 


0.87 


0.57 


0.42 


0.00008 


0.67 


0.008 


TO.l 


0.93 


0.47 


0.52 


0.00003 


0.61 


0.01 


W2000 


0.56 


0.89 


0.1 


0.002 


0.62 


0.004 


W2000d 


0.76 


0.67 


0.32 


0.0002 


0.65 


0.006 


W3000d 


0.65 


0.8 


0.19 


0.0006 


0.66 


0.004 
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In table 2, approximately all temporal-based methods out-perform the baseline 
method. Our aging method achieves both reasonable precision and recall which re- 
sults in the best FI score, while the time-based threshold method achieves good pre- 
cision but loses recall. The time window method has good recall but decreased preci- 
sion. Even though the lowest cost comes from the time window method, we believe 
that is due to the fact that the majority of testing events are long-running. If we sepa- 
rately compare the time window method with our aging method on a short-term 
event, as shown in table 3, we find that the time window method is not suitable for 
short-term events. During the detection process on a short-term event, the fixed win- 
dow size will overly emphasize the influence of the short-term event so that many 
context-similar but event-different news stories are merged into the event, which 
therefore results in low precision scores. As indicated in Figure 5, the last peak of the 
curve of the window2000 method is a mis-merged event. Our aging method, on the 
other hand, lets the event control its own lifespan; consequently it outperforms the 
time window method for short-term events. 



Table 3. Experiment results on short-term event. 






P 


r 


m 


F 


FI 


c 


A 


0.61 


0.79 


0.2 


0.0003 


0.69 


0.004 


W2000 


0.12 


0.91 


0.08 


0.005 


0.21 


0.006 


WlOOOd 


0.37 


0.83 


0.16 


0.001 


0.51 


0.004 


WSOOOd 


0.22 


0.83 


0.16 


0.002 


0.34 


0.005 



Day by day detection results on a short-term event 




6 7 8 9 10 11 12 13 14 15 16 17 

day 



71 True Data 



aging 

window2000 

window2000d 

----- window3000d 



Fig. 5. Event detection results on a short-term event. 

The time-based threshold method sacrifices its recall to achieve high precision, es- 
pecially in long-running events (as shown in table 4). The growing threshold could 
keep somewhat context-similar, but event-different, news stories from being included 
in a detected event. Thus it results in a substantial amount of clusters. However, 
continuingly increasing the threshold may break the storylines of a long-running 
event into pieces, which consequently results in a high miss rate. As we can see from 
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into pieces, which consequently results in a high miss rate. As we can see from Figure 
6, the long-running event was fragmented into twelve clusters when using the time- 
based threshold method, while our aging method merely broke the event into one 
major and three trivial parts. 



Table 4. Experiment results on 6 long-running events. 






P 


r 


m 


.f 


FI 


c 


A 


0.72 


0.68 


0.31 


0.0002 


0.67 


0.006 


TO. 02 


0.88 


0.46 


0.53 


0.00005 


0.60 


0.01 


TO. 05 


0.87 


0.51 


0.49 


0.00007 


0.61 


0.009 


TO.l 


0.94 


0.37 


0.62 


0.00002 


0.52 


0.012 




long-runnitig event against aging method 



S duster! 
□ cluster2 
B cluster3 
E3 cluster4 



88 % 



long-rumiing event againt time-based threshold 




(0.1) method 

7 % 2 % 



44% 



B dusterl 

□ duster2 
S dusterl 
0 duster4 
@ dusterl 
@ dusterO 
0 duster? 
0 dusterS 
0 dusterO 
0 dusterlO 

□ dusterl 1 

□ dusterl 2 



Fig. 6. Aging method on a long-running event. 



Experiment results show that the time window method scores high on long running 
events but perform poorly for short-term events. The time-based threshold has a high 
precision rate as well as a high miss rate on long-running events. Our aging method 
has fairly good performance in both long-running and short-term events. 



6 Conclusions 

In this study, the aging theory is incorporated into the traditional single-pass cluster- 
ing algorithm to detect and track news events. The growth of energy value synchro- 
nizes well with the event progress. Moreover, experiment results show that the pro- 
posed method performs well for both long-running and short-term events in 
comparison to other approaches. We are now applying our method to other catego- 
ries, such as finance, sports and entertainment. We believe that using category- 
specific aging parameters achieves the best results for event detection in all possible 
categories. 
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Abstract. We consider inference of automata from given data. A clas- 
sical problem is to find the smallest compatible automaton, i.e. the 
smallest automaton accepting all examples and rejecting all counter- 
examples. We study unambiguous automata (UFA) inference, an inter- 
mediate framework between the hard nondeterministic automata (NFA) 
inference and the well known deterministic automata (DFA) inference. 
The search space for UFA inference is described and original theoretical 
results on both the DFA and the UFA inference search space are given. 
An algorithm for UFA inference is proposed and experimental results on 
a benchmark with both deterministic and nondeterministic targets are 
provided showing that UFA inference outperforms DFA inference. 



Introduction 

Motivations: We consider inference of nondeterministic automata (NFA) from 
given data. A classical problem is to find the smallest compatible automaton, 
i.e. the smallest automaton accepting all examples and rejecting all counter- 
examples. When automata are deterministic (DFA), the problem has been ex- 
tensively studied and is NP-complete [Gol78,PW89]. However, if enough exam- 
ples and counter-examples are provided, polynomial inference algorithms using 
state-merging method perform well [OG92,Lan92,LPP98]. 

NFA inference is known to be harder than DFA inference [Hig97]. But, in the 
Occam’s razor paradigm, it is worth noticing that NFA may be exponentially 
smaller than DFA. NFA also represent some structures - like “gaps” in genomic - 
more explicitly than DFA, and therefore are better suited to be interpreted by an 
expert of the application domain. Experimental results of [GF00,DLT01] show 
that inferring regular languages using classes of automata containing nondeter- 
ministic representations is a promising approach. 

Nevertheless, all the complexity of NFA is not necessary to take advantage 
of nondeterminism. We propose to study the inference of an intermediate class 
of automata, the unambiguous automata (UFA). As we will show in this article, 
inferring UFA enables to introduce a reasonable amount of nondeterminism while 
keeping some advantages of the DFA representation. 
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To tackle UFA inference, we consider this problem as a search of a particnlar 
UFA in a space of NFA. We propose to adapt states-merging methods - which 
have been proven snccessfnl for DFA inference - to realize UFA inference. We first 
describe the search space for NFA inference in the state-merging framework by 
revisiting resnlts of [DMV94] (section 1). Then, we propose operators allowing 
to explore this search space by considering only nnambignons antomata (section 
2). Thanks to operators defined in section 2, different strategies for exploring the 
search space can be applied. We have implemented a greedy strategy together 
with a heuristic inspired from classical DFA inference algorithms. This algorithm 
is shown to perform better on a benchmark of the domain than the original DFA 
algorithm. A comparison with the DeLeTe2 algorithm which infer residnal finite 
state antomata (RFSA) [DLTOl] showing that each algorithm is more adapted 
to different snbparts of the benchmark is also given. 

Definitions and Notations: We denote by \E\ the cardinality of a set E. A 
partition of a set E is a set of snbsets of E snch that the intersection of each pair 
of snbsets is empty and the nnion of all snbsets is E. An element of a partition 
is called a block. Let U be a finite alphabet, we denote by E* the set of words 
on E, by e the empty word and by |m| the length of a word u of E*. 

Definition 1. A nondeterministic automaton, or NFA, is a 5-tuple {E,Q,I,S, 
E) where E is the inpnt alphabet, Q is a finite set of states, I C Q is the 
set of initial states, <5 is the transition mapping defined from Q x E to 2^ , 
F is the set of final states. The S function is classically extended to words by: 
Vg €Q, \/ae E, \/w G E*, S{q,e) = {g}, S{q,aw) = U,'Gi(,,a) S{q',w). A tuple 
(g, a, g') with q' G <5(g, a) is called a transition 

The regnlar langnage recognized by an automaton A is L(A) = {w G E* \ 3qi 
G I, S{qi, w)DF 0}. We associate two languages to each state g of an automa- 
ton, its prefix language which is the set of words w such that g G 6{I,w)~, and its 
suffix language which is the set of words w such that 6{q, in) n T 0. NFA are 
considered trimmed (i.e. no state has an empty prefix or suffix language). The 
size of a NFA A is defined as its number of states. 

A deterministic finite automaton, or DFA is a NFA {E, Q, I, 6, F) such that: 
|/| = 1 and Vg G Q,Va G U, |<5(g, a)| < 1. Some particular DFA can be de- 
fined. The canonical automaton of the regular language L, denoted by A{L), 
is the unique minimal DFA accepting L. The universal automaton, UA(E) or 
more simple U A, is the canonical automaton A{E*) accepting all words on E 
(figure 2). 

An acceptance for a word w G E* - with w = a\ . . . a|u,| - in an automaton 
A = {E, Q, I, S, F) is a sequence (go, . . . , g|u)|)iu of |w;| -|- 1 states such that go G I, 
Vi G [1, |in|], qi G 5{qi-\,ai), q\yj\ G F. Transitions (gi_i,Oi,gi) are said reached 
by the acceptance. The ambiguity degree of an automaton A is the maximum 
number of acceptances that exist in A for a word of E* . An unambiguous finite 
automaton, or UFA, is a NFA with an ambiguity degree inferior or equal to one 
(figure 1). When a NFA is not a UFA, we say it is ambiguous. 
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a,b 

Fig. 1. An example of UFA, representing the language E*aE 



Notice that the class of DFA is included in the class of UFA. DFA and UFA 
are obviously included in the class of NFA. NFA, UFA and DFA can represent 
any regular language. 

This document includes theorems for which only hints of proofs are provided; 
for complete proofs the reader can consult [CF03]. 

1 Search Space for Automata Inference 

The search space we want to explore is the restriction to UFA of the search 
space for NFA inference by means of state-merging methods. This first section 
presents and revisits the NFA search space described by [DMV94,Dup96]. Next 
section will study its restriction to UFA. 

In the framework of inference from given data [Gol78], we try to infer lan- 
guages from a training sample. In this paper, we define a training sample of a 
language L to be a couple of finite sets (5+,5_), where S'+ C L is called the 
positive training sample and S- C S* \ L is called the negative training sample. 
For the sake of clarity we consider only the positive training sample in sections 1 
and 2. However, results of these sections can be easily extended to consider unbi- 
ased inference [AS95,Cos99], i.e. to consider symmetrically the two parts of the 
sample. 

An underlying assumption for inference of an automaton is that the positive 
training sample is “representative enough” of the language to learn. This can be 
formalized by the notion of structural completeness which intuitively means that 
all constituents of target automaton are useful for the sample recognition. More 
formally: 

Definition 2. A positive training sample S.^. is said to be structurally complete 
with respect to an automaton A iff there exists an acceptance set A containing 
exactly one acceptance for each word of 5"+ such that: 

— every transition of A is reached by an acceptance, of A, 

— every initial state of A is the first state of an acceptance of A, 

— every final state of A is the last state of an acceptance of A. 

Structural completeness hypothesis enables one to restrict the search space 
to a finite ordered set^ of automata with a top and a bottom element. The top 
element of this set is the universal automaton and the bottom element is the 
Maximal Canonical Automaton (figure 2). 



^ Vocabulary of this paper concerning ordered sets is taken from [DP90]. 
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Definition 3. The maximal canonical automaton with respect to a positive sam- 
ple = {tci, • ■ ■ , tc|s^|}, denoted by MCA{S+) or more simply MCA, is the 
union of canonical automata A({u;i}) for each word of the sample (i G [ 1 , |5'+|]/ 

The MCA realizes a learning by rote of the positive sample. Inference in the 
state-merging framework consists in generalizing the language recognized by the 
MCA by merging its states (or unifying them, see [DMV94] for a constructive 
definition). Given an automaton A, and a partition tt on states of A, we can 
construct an automaton A/tt. A/tt is constructed by merging the states of A 
being in a same block of the partition . We say that Ajir is derived from A with 
respect to partition tt. 

We denote the set of partitions on the states of MCA by P{MCA). An order 
on the partitions of P{MCA) can be defined as follows: we say that a partition tt 2 
directly derives from partition tti, denoted by tti ^ 7 T 2 , if 7 T 2 can be constructed 
from 7 Ti as follows: 3bi,b2,G TTi,bi ^ b 2 ,TT 2 = (tti \ {&i,& 2 }) U {bi U 62 }- The 
transitive closure of ^ is denoted by ^*. P{MCA) is a complete lattice of 
partitions under the order relation. 

The relation between partitions is extended to the relation between 
automata as follows: Ai -<a A 2 Bwi, tt 2 G P{MCA), tti ^ 7 T 2 , Ai = 
MCA/tti, A 2 = MCA/TT 2 - The transitive closure of -<a, denoted by defines 
an order relation on automata. An automaton A inferior in the sense of to 
an automaton A' is said to be derivable from A' . 

Let A^pa{MCA), or more simply A(MCA), denote the set of NFA derivable 
from MCA. In the following sections, we extend this notation to any classes 
of automata and any automata, for example Adfa(^) will denote the set of 
deterministic automata derived from automaton A. The following theorem holds 
(illustrated by figure 2 ). 

Theorem 11 The search space for NFA under the hypothesis of structural com- 
pleteness of a positive training sample is A{MCA{S.y.y). 

Hint of the proof: The proof is an extension of the proof provided by [DMV9f] tak- 
ing into account our more precise definition of structural completeness and NFA having 
more than one initial state.C 




Fig. 2. Universal automaton (UA), Maximal canonical automaton (MCA) for S+ 
{aaa,bba,baaa}, P(MCA) and A{MC A). 
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Fig. 3. We cannot deduce that P{MCA) being a 
lattice, A{MCA) is also a lattice because an au- 
tomaton of A{MCA) can be derived from more 
than one partition of P{MCA). The figure illus- 
trates this point by exhibiting a couple of au- 
tomata - the two on top - without greatest lower 
bound under order relation (relation is repre- 
sented by arrows). 



Let us remark that, as illustrated by figure 3, even if P{MCA) is a lattice 
of partitions, A{MCA) is not a lattice of automata under order relation. 
This shows clearly a misuse of terms used in the regular grammatical inference 
community. 



2 Search Space for UFA Inference 

2.1 Prom DFA to UFA 

The inference search space has often been restricted to DFA, and NFA inference 
can be considered to be harder than DFA inference (indeed, NFA do not have a 
canonical form and are not polynomially learnable from given data whereas DFA 
are [Hig97]). We show, in this section, that properties known for the restriction 
of the search space to DFA are also valid for the restriction of the search space 
to UFA. 

The bottom element of the search space of DFA is the prefix tree acceptor 
denoted by PTA{S+) or more simply PTA [DMV94] and is obtained by deter- 
minisation of MCA. For UFA, MCA being unambiguous, the bottom element 
stays MCA like for the search space of NFA. 

A first link between DFA and UFA search space is given by theorem 21: 

Theorem 21 Let A he a UFA and S'+ a positive training sample structurally 
complete with respect to A. There exists one and only one partition tt in P{MCA) 
such that A = MCA/tt. 

Hint of the proof: There exists a partition n such that A = MCA/tt (entailed by 
UFA C NFA and theorem 11). We have to show that this partition is unique. 

For each word w € S+, A being a UFA, there exists only one acceptance acci for 
this word in A. The acceptance acc 2 for w in MCA defines a mapping function from 
states of MCA to states of A, every ith state of acc 2 being mapped to ith state ofacci. 
This mapping defines for every state of MCA the unique block of partition it can be 
in, and therefore the unique possible partition. □ 

This property was known for DFA and theorem 21 replaces it in the more 
general framework of UFA. From theorem 21 and as illustrated by figure 4, UFA 
have the advantage over NFA of being represented by only one partition. DFA 
have the advantage over NFA and UFA of having a canonical form. 
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MCA(S=I aiiauHii n 



Fig. 4 . The figure shows partitions of min- 
imum size 2 representing the same lan- 
guage L = in P{MCA) with S+ = 
{{aaaaaa}). When looking at derived au- 
tomata from these partitions, we count a 
unique DFA, two UFA and 7 NFA, 5 being 
isomorphic. 



To explore the search space of UFA, we could consider only state-merging 
from the MCA leading to other UFA. Indeed, we show in [CFOS] - extending a 
theorem of [Dup96] for DFA - that all UFA of the search space can be reached 
from the MCA by a sequence of merge considering only UFA. 

Nevertheless, we focus in this paper on another state-merging operator for 
UFA inference called unambiguous merging. This operator can be considered as 
the counterpart of the deterministic merging operator which has been extensively 
used in DFA inference algorithms (e.g. [OG92,LPP98]). 

2.2 Prom Deterministic Merging to Unambiguous Merging 

The deterministic merging operator is based on a procedure called merging for 
determinisation. After introducing a few definitions and a property, we present 
the dual merging for disambiguisation procedure and then the unambiguous 
merging operator^. 

Two states q\ and Q 2 are said to be in common prefix relation (resp. in com- 
mon suffix relation) if the intersection of their prefix languages (resp. their suffix 
languages) is not empty. Two states qi and 92 simultaneously in common prefix 
relation and in common suffix relation are said in parallel acceptance relation., 
denoted by || g 2 - 

Property 21 An automaton A = {U,Q, I,S, F) is ambiguous iff it has two 
different states in parallel acceptance relation. 

Hint of the proof: For every couple of different states (<71,52) in parallel acceptance 
relation, there exists a word u common to their prefix languages and a word v common 
to their suffix languages. This is equivalent to the existence of two acceptances for the 
word uv, the first reaching qi and the second 52 by the word u. □ 

The sets of common prefix and common suffix relations can be computed 
and incrementally maintained after each merge [CFOO]. Common suffix relation 
is presented here for the first time but can be maintained exactly like incom- 
patibility relation presented in [CFOO]. Parallel acceptance relation is directly 
deduced from the previous. 

By using these relations, we can now define the merging for disambiguisation 
procedure (algorithm 1). This procedure consists in merging pair of states in 

^ Formal properties of the deterministic merging operator have never been formalized, 
this section provides both its extension to the unambiguous case and a formalization 
of this operator properties. 
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Algorithm 1 Merging for disambiguisation of A = {S, Q,I,S, F) 

1: while 3gi,Q2 € Q,qi || q 2 , qi / ?2 do 
2: A ^ merge{A,q\,q 2 ) 



parallel acceptance relation. Each merge possibly entailing new relations, the 
procedure stops merging when no more couple of states are in parallel acceptance 
relation. 

Compared to merging for determinisation, which can be defined as merging 
of all states in common prefix relation, merging for disambiguisation merges all 
states both in common prefix relation and common suffix relation. Therefore 
merging for disambiguisation realizes only a subset of the merging needed by 
merging for determinisation and allows a finer exploration of the search space. 

Merging for disambiguisation (resp. merging for determinisation) does all 
necessary and sufficient merging to reach the “closest” UFA (resp. DFA) derived 
from a NFA. We formalize this fact for UFA by property 22: 

Property 22 Let A be a NFA, and A' the UFA obtained by merging for disam- 
biguisation of A. Then every UFA of A upa[A) is in Aijpa{A'). 

Hint of the proof: Let A — Ai, A 2 , , An = A' be the sequence of automata created 
by the merging for disambiguisation procedure. From property 21 we can show that there 
is no UFA in AuFA^Ai) — AuFA{Aip\) . The theorem can then be proven by induction 
on i £ [1, n[. □ 

Let us remark that property 22 entails that whatever the order of merging 
realized by merging for disambiguisation (or merging for determinisation), these 
merging always lead to the same automaton. 

We now introduce the operator of unambiguous merging (resp. determin- 
istic merging. Unambiguous (resp. deterministic) merging consists in merging 
two states of an automaton and applying merging for disambiguisation (resp. 
determinisation) to the resulting automaton. 

We will denote by A\ -<dis A 2 (resp. A\ -<det A 2 ) if automaton A 2 can 
be obtained by applying one unambiguous (resp. deterministic) merging on Ai. 
Relations ^dis ^det denote respectively the transitive closure of ^dis 
and ^det- 

As shown by theorem 22, every UFA derived from a given UFA A - i.e. 
automata of Aufa(A) - can be reached by a sequence of unambiguous merging 
from A. More formally: VAi, A 2 € UFA, Ai e A{A 2 ) => A 2 

Theorem 22 Let A be a UFA, for all UFA A of Aupa{A), there exists a se- 
quence of automata Ag,...,A„ such that Aq -<dis A\ -<dis ... -<dis A„ and 

A = Aq, a = An. 

Hint of the proof: This can be proven as a consequence of property 22.0 

The counterpart of this theorem for DFA and deterministic merging is also true, 
i.e.: VAi, A 2 G DFA, Ai G A(A 2 ) A 2 -^det 

Section 3 presents the use of the operator of unambiguous merging to explore 
the space of UFA. 
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3 Experimental Comparison 

3.1 Algorithms and Benchmarks 

Section 2 detailed both the search space for UFA and operators available to 
explore it. Different strategies can be applied when using these operators. Our 
experimental results are based on a greedy search - presented by algorithm 2 - 
which is the classical approach applied for DFA inference (e.g. [OG92,LPP98]). 
The choose-two-states method of this algorithm represents the heuristic, i.e. 



Algorithm 2 Principle of greedy state-merging algorithms. 
Function greedy-state-merging-algorithm(S' = (S+,S-)) 

A ^ MCA(S+) (or A ^ PTA(S+) for DFA inference) 
while choose-two-states(A, gi, (/2) do 
A' <— state-merging(A, qi, 52) 
if A' is compatible with S- then A <— A' 
return A 



the order used to try state-mergings. The state -merging method depends on 
which class of automata is inferred: we use deterministic merging and unambigu- 
ous merging for respectively DFA and UFA inference. 

We compared the best heuristic known for DFA inference, called EDSM 
[LPP98], a hill-climbing strategy for UFA (detailed in subsection 3.2) and in- 
ference of RFSA (Residual Finite State Automata) with the DeLeTe2 algorithm 
[DLTOl]. The experimental comparison of DFA, UFA and RFSA inference is 
based on benchmarks provided in [DLTOO, DLTOl]. These benchmarks contain 
training and testing sets for languages generated using different methods: con- 
struction of random DFA, random NFA and random regular expressions. We 
added to this benchmark languages generated by a UFA generator. 

The UFA generator takes five parameters: a number of states TV, a proba- 
bility Pi for a state to be initial, a probability pf to be final, an alphabet S 
and a number of transition t. After constructing the N states of the generated 
automaton A, each state is set initial with probability pp, then each state is set 
final with probability pf, except if this entails that A became ambiguous; and 
then, we try t times to insert a new transition (51,0,(72) between states of A 
(51, 52 and a being chosen uniformly in Q x S x Q), this transition insertion 
is rejected if it entails A to be ambiguous. UFA of the benchmark have been 
generated with parameters: N = 20, pi = 0.3, pf = 0.3, t = 60 and S = {0, 1}. 

Training and testing samples are generated with the method used in [DLTOl]: 
for each word w of the sample, its length is chosen uniformly in ]0, 29] and w is 
chosen uniformly between words of this length, w is labeled by ’+’ if it is in the 
generated language and by otherwise. 30 languages are generated for each size 
of training sample (50, 100, 150 or 200). The generated language is kept only if 
the corresponding training sample contains at most 80%, and at least 20%, of 
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words labeled by ‘+’. Testing sample of each language contains 1000 examples 
and counter-examples. 



3.2 Heuristics for UFA and DFA Inference 

Heuristic: For UFA inference, we use a hill-climbing heuristic, i.e. we choose the 
unambiguous merging leading to the smallest automaton (which is equivalent to 
the unambiguous merging entailing the most state- mergings). For DFA inference 
we used the EDSM heuristic (for Evidence Driven State-Merging) . This heuristic 
has been proposed by [LPP98] and won the grammatical inference competition 
Abbadingo [Abb98]. EDSM chooses the deterministic merging that entails the 
most number of merge between final states by merging for determinisation. 

In practice, these two heuristics need the computation respectively of each 
possible deterministic mergings and unambiguous mergings of two states. A score 
is given to each state pair (consisting in the number of merged states for UFA, 
and of the number of merged final states for DFA), and the states pair with the 
best score is choosen. 

Even if a priori different, these two heuristics may be seen as closely related 
to each other with respect to the notion of acceptance. 

Indeed, the choice of counting merge between final states instead of merge 
between every states in EDSM can be seen as a measure of the “size” of the 
intersection of suffixes languages of the two scored states. The prefix languages 
of states of a DFA being disjoint, this measure can also be seen as a measure of 
the number of acceptances being unified by the state-merging. 

This idea is also present in the hill-climbing heuristic for UFAs. Each state- 
merging computed by the merging for disambiguisation procedure is due to the 
existence of two acceptances for a word. These state-mergings therefore enable 
to unify acceptances, and the number of merged states can can be considered as 
a measure of the number of unified acceptances. 

Use of counter-examples: Counter-examples may be used in different ways. 
We can consider biased inference which consists in generalizing examples and 
stopping the generalization with the counter-examples [DMV94]. We can also 
consider unbiased inference [AS95,Cos99], which consider that the couple S+ and 
S- are examples respectively of the target language L and of L_ = U*\L. In this 
context, the two languages L and L_ are inferred by generalizing simultaneously 

and S- using a classifier automaton. Generalization is stopped with the 
constraint L n L_ = 0 (instead of L n S'- = 0). 

The DeLeTe2 algorithm works in the biased inference paradigm. The EDSM 
heuristic has been presented in [LPP98] in the unbiased inference paradigm but 
can also be applied to biased inference (as presented in the previous paragraph). 
In this paper we compare the use of the EDSM heuristic for DFA inference both 
in the unbiased and biased paradigm, hill-climbing for UFA inference both in the 
unbiased and biased paradigm and DeLeTe2. Corresponding algorithms will be 
denoted respectively T>edsm, Dbedsmj ^kc, Ub/ic, and DLT2. We will also consider 
the majority vote denoted by MAJ. 
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Fig. 5. Average recognition level on test sets. 



3.3 Inference Results 

The evaluation consists in scoring each algorithm for each benchmark thanks to 
its average recognition level on the test sets (fignre 5). Like [DLTOl], we also 
compare algorithms thanks to matches (noted algol-algo2 in fignre 6). A match 
consists in counting the number of time an algorithm is better than another 
(in term of recognition rate), and we connt a tie when the difference is not 
significant (nsing the Mac Nemar test [Die98]). Those matches are noted as 
tuple: wonByAlgol,nbTie,wonByAlgo2. Since experiments have been made on 
different machines with different implementations, comparing rnnning time is 
difficnlt. Nevertheless, in these experimentations, onr algorithm Vh^c seems to 
be 2 orders of magnitnde slower than DLT2, which is slower than Dhgdsm- The 
symbol * in the cell designates when some experiments have not finished dne to 
the time limit of lOOh on 750 Mhz cpu (more precisely, two rnns on the 480 rnns 
of the algorithm \Jhc did not finished on time). 

We remark that algorithms based on UFA inference have better recognition 
scores on the benchmark than the original DFA algorithm (Ub/ic won 170 times 
against 119 for Dbgdsm)- This resnlt was hoped for NFA, regnlar expressions 
and UFA benchmarks. More snrprisingly, UFA inference performs better than 
DFA inference on the DFA benchmark. We explain this by considering that 
choosing the wrong nnambignons merging at a step of the algorithm canses less 
constraints on futnre mergings than choosing a wrong deterministic merging. 
A wrong nnambignons merging can therefore be “partly corrected” by fnture 
mergings. 

We can also remark that the biased versions of the algorithms are much 
better than the nnbiased one on benchmarks for which L and E* \ L are not 
generated symmetrically (Ub/ic won 135 times against 63 for \Jhc, and Dbedsm 
won 168 times against 56 for T)edsm on these benchmarks). 
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Fig. 6. Matches between algorithms. 



When comparing UFA and RFSA inference, tables of figures 5 and 6 show 
that UFA inference and RFSA inference are each better suited to different sub- 
part of the benchmark: Ub/ic performs better than DLT2 on UFA and DFA based 
benchmarks. However, DLT2 is the best algorithm on benchmarks based on NFA 
and regular expressions. Thus, we suppose that the class of UFA is “closer” to 
DFA than the class of RFSA is “close” to NFA and regular expressions. 

Conclusion 

We have revisited the search space for automata inference. We formalized prop- 
erties known on the DFA search space, and extended them to the UFA search 
space. This work leaded to the extension of the well known deterministic merging 
operator to the unambiguous merging operator, which seems very promising for 
automata inference. Indeed, this new operator allows us to propose a heuristic 
closely related to EDSM [LPP98]. The use of the unambiguous merging operator 
together with this heuristic has been shown to perform well on benchmarks of 
the domain. 

Deeper studies on the deterministic merging operator and on the unambigu- 
ous merging operator have shown that these operators give a lattice structure to 
the search space [CF03]. Therefore, practical results presented in this paper use 
only part of the available theoretical properties. Integrating these properties in 
inference algorithms is an open perspective to our research. 
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Abstract. Due to the unavoidable fact that a robot’s sensors will be 
limited in some manner, it is entirely possible that it can find itself unable 
to distinguish between differing states of the world (the world is in effect 
partially observable). If reinforcement learning is used to train the robot, 
then this confounding of states can have a serious effect on its ability 
to learn optimal and stable policies. Good results have been achieved 
by enhancing reinforcement learning algorithms through the addition of 
memory or the use of internal models. In our work we take a different 
approach and consider whether active perception could be used instead. 
We test this using omniscient oracles, who play the role of a robot’s 
active perceptual system, in a simple grid world navigation problem. 
Our results indicate that simple reinforcement learning algorithms can 
learn when to consult these oracles, and as a result learn optimal policies. 



1 Introduction 

Partially observable environments cause particular problems when reinforcement 
learning is used to learn the task. Reinforcement learning algorithms associate 
rewards and actions with states, but in partially observable environments the 
true state can be masked. This makes it virtually impossible (apart from some 
very simple worlds [1]) for basic 1-step reinforcement learning algorithms to 
converge to optimal policies [2, pp73-78]. Work in this area has shown that 
when learning algorithms are augmented with memory [3] or the ability to learn 
a model of their world [4,5], they can find optimal solutions to this type of 
problem. Our aim is to examine an alternative approach. Rather than equipping 
the learning algorithms with memory or the ability to model their environment, 
we propose to equip agents or robots with sensors which they can actively control, 
and hope to demonstrate that the learning algorithms can learn to use this active 
perception to find optimal solutions. 

This paper considers the questions: (i) could an agent learn to use an active 
perceptual system?, (ii) is there any benefit for an agent in using such a system 
to disambiguate its current state? To study the fundamentals of the problem we 
consider simulated agents moving around deterministic grid worlds, for example 
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Sutton’s Grid World (Fig. 1). In order to focus on the questions raised above we 
make the assumption that it is possible to give an agent an active perception 
system which it could use to disambiguate the current state. To this end we 
introduce oracles which the agent can consult. These play the role of active 
perception systems which can disambiguate the agent’s current state perfectly, 
see section 2.4. Thus the hypotheses that we attempt to test are: 

(i) Reinforcement learning algorithms can learn when to use additional resources 
in order to determine an agent’s true state in a partially observable grid 
world. 

(ii) Resolving the agent’s true state in a partially observable world allows rein- 
forcement learning algorithms to learn optimal policies. 

2 Background 

2.1 Reinforcement Learning &: Partially Observable Worlds 

Whitehead [2, p72] identifies two distinct problems which occur when using re- 
inforcement learning with robotic systems in partially observable environments: 
local and global impairment. 

As an example of local impairment consider a robot standing at one of two 
similar looking T-junctions. It is unable to distinguish between the two junctions 
and hence regards them as the same state. We use the phrase aliased state to 
refer to such states which appear to be identical but in the underlying model are 
distinct. The robot’s policy when learnt using reinforcement learning can only 
link a single action to the perceived single state, thus it will execute the same 
action at both of the T-junctions. If at one T-junction the optimal action is to 
turn left and at the other it is to turn right then the single action learnt by the 
policy will be wrong in one of the two cases^. More generally, when learning a 
state-action policy where there exist aliased states, the policy will sometimes 
select actions which are inconsistent with the actual underlying state. 

Global impairment occurs where inconsistent state values produced by aliased 
states are used to update the state values of otherwise consistent states. This 
occurs independently of whether the action selected in the aliased states is in- 
consistent. As an example, consider again the robot standing at one of the two 
T-junctions. If the optimal action at both T-junctions is the same, e.g. turn left, 
then action selection is not a problem as the policy can link this single action 
with the single perceived state, however there is still a problem with representing 
the value of the aliased states. In reinforcement learning the policy is generally 
represented by storing a value for each state or each state-action pair. These 
values indicate the utility of being in that state (or selecting a given action from 
that state). Given that the two T-junctions are perceived as one and the same 
location, a single state value or single set of state-action values will be stored. 
The stored value (s) will be updated to represent the value of being at both of the 
T-junctions and hence over time become a weighted^ average of the true value 

^ Assuming the only possible actions are turn left or right. 

^ Weighted by the number of times each state is visited. 
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of each of the underlying states. Now, if one junction is far from the goal and the 
other is very close to the goal, their averaged value will make the junction that is 
far from the goal appear more attractive than it should, while the junction near 
the goal will appear to be a less valuable state than it should. If 1-step backup 
of state values is employed, such as in SARSA or Q-learning [6, pl46, pl49], 
then these averaged state values will be used to update those states around the 
T-junctions causing errors in their state values and possibly also in the selection 
of actions. These errors in the state values can propagate from state to state 
affecting the whole of the robot’s policy. 



2.2 Satisficing Optimal Memoryless Policies 

Littman [7] considered learning state-action policies in partially observable envi- 
ronments and introduced the useful concepts of satisficing and optimal memory- 
less policies. A policy is said to be satisficing if independent of its initial state an 
agent following this policy is guaranteed to reach the goal. A memoryless policy 
is a policy which selects an action based solely on the current sensation. SARSA 
and Q-learning are examples of reinforcement learning algorithms which work on 
exactly this basis. If the performance of a policy is measured by the number of 
steps taken to reach the goal summed over all possible initial states of the agent 
{total steps), then an optimal policy is one that achieves the minimum total 
steps to the goal and an optimal memoryless policy is a policy that achieves the 
minimum number of total steps which can be achieved by a memoryless policy. 

Singh et al.[8] proved that an optimal memoryless policy is arbitrarily worse 
than the optimal policy which can be achieved in the absence of aliased states. 
Despite this result, Littman [7] used branch and bound techniques to demon- 
strate that it is possible to find optimal memoryless solutions for various grid 
world navigation problems, and for the range of grid worlds examined, the opti- 
mal memoryless policies did not take an unreasonable number of extra steps. 



2.3 Eligibility Traces 

Whitehead [2] indicates that global impairment prevents 1-step backup reinforce- 
ment learning algorithms from being able to learn stable and optimal policies 
for partially observable environments. However Loch and Singh [9] showed that 
the simple addition of eligibility traces to reinforcement learning allows it to find 
optimal memoryless policies in grid world navigation problems. Eligibility traces 
appear to avoid the problem posed by global impairment by updating a chain 
of previously visited states. 



2.4 Active Perception 

By active perception we mean a perceptual system which an agent can direct 
in order to vary the input it receives from the world. An obvious example is 
a video camera mounted on a robotic arm. An agent equipped with such a 
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camera can direct the movement of the arm and thus obtain different views of 
its surroundings. In the context of grid worlds, active perception would involve 
the agent being able to choose which grid squares make up its input state. At this 
stage however we are interested in simplifying the problem in order to understand 
the underlying dynamics. Therefore rather than equip the grid world agents with 
active perception systems which they would have to learn to coordinate, we have 
instead provided them with oracles. 

Introducing oracles which the agents can consult is a useful test. If an agent 
can learn when to make appropriate use of an oracle, then it should be able to 
learn when to use an active perceptual system. On the other hand, if it fails 
to make use of an oracle, it seems unlikely that it could learn to use an active 
perception system, especially as the latter may require the coordination of many 
more actions. In the experiments presented below we have used two types of 
oracle. These correspond to two possible questions “where am I?” or “where 
should I go?”. The first, the State Oracle, can on request provide the agent with 
an unambiguous state representation of its current location. The second, the 
Action Oracle, explicitly directs the agent based on a known optimal solution 
(the solution used is that indicated by the arrows in Fig. 1). 

Our hypothesis, that agents using active perception (or in this case oracles) 
together with eligibility traces should be able to learn optimal solutions, stems 
from the observation that oracles resolve the problem of local action selection, 
and eligibility traces address the issue of global impairment. 

We do not currently envisage using oracles when implementing this technique 
on real robots. They are introduced as a useful simplification to aid understand- 
ing of the underlying problems. However, it is possible to imagine scenarios 
where resource constraints might make the use of oracles worthwhile, e.g. mass 
produced military robots with limited computational power and sensors, which 
have access to a central computer to aid their exploration. To avoid overloading 
the central computer or to minimise the number of transmissions they make, 
these robots might only call on the central computer (their oracle) when unable 
to determine their location independently. 

3 Experiments 

The experiments presented here use Sutton’s Grid World (Fig. 1) which consists 
of a 9 X 6 grid containing various obstacles and a goal in the top right hand 
corner (indicated by an asterisk). An agent in this world can choose between 
four physical actions; move north, south, east and west. The agent receives a 
reward of —1 for each action which does not move it directly to the goal state 
and a reward of 0 for moving directly to the goal state. State transitions are 
deterministic and each action moves it one square in the appropriate direction. 
If an agent selects a physical action whose execution is obstructed by an obstacle 
or wall, then the agent’s location remains unaltered but it receives the same —1 
reward as above. When the agent reaches the goal state it is relocated to a 
uniformly random start state. 
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Fig. 1. Sutton’s grid world. Values indicate the observations obtained by an agent 
observing the eight squares surrounding its current location (Eight Adjacent Squares 
Agent). Arrows shows an example optimal policy 




Fig. 2. Eight Adjacent Squares Agent has a state representation formed by observing 
the eight squares surrounding its current location 



Littman [7] modified Sutton’s original problem by introducing an agent whose 
state representation is formed by observing the eight squares adjacent to its 
current location (Fig. 2), an Eight Adjacent Squares Agent. This is opposed to an 
agent whose state representation is its location in the world given in Cartesian 
coordinates and is more representative of the problem faced by a robot in a 
building. For an Eight Adjacent Squares Agent there are multiple locations that 
are aliased; Fig. 1 shows indicative values which represent the perception of 
such an agent. Some of the aliased states cause the agent to learn inconsistent 
actions, e.g. perception 148 occurs in three locations: (i) four squares directly 
below the goal, (ii) near the middle of the obstacle which is to the left of the 
goal and (iii) near the middle of the far obstacle. As can be seen from Fig. 1 the 
optimal solution (indicated by the arrows) requires a different action in one of 
the three occurrences of this perceptual state. Other aliased states cause only 
global impairment, e.g. perception 2 which occurs just below the obstacle near 
the goal and just below the far obstacle. In this case the optimal action is always 
the same, i.e. move east, but the aliased state values cause problems as one 
location is very near the goal and the other is quite a distance away. 

We used three types of agent: 

(i) Eight Adjacent Squares Agent. An agent whose state representation is 
formed by observing the 8 squares adjacent to its current location, as shown 
in Fig. 2. 
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(ii) Action Oracle Agent. An agent who has the same state representation as 
the Eight Adjacent Squares Agent but has an additional action allowing it 
to ask the Action Oracle in which direction it should go. The agent then 
immediately executes the action specified by the oracle. 

(iii) State Oracle Agent. An agent whose normal state representation is the same 
as the Eight Adjacent Squares Agent, but who can ask the State Oracle 
where it is. On asking the oracle the agent receives a state representation 
that corresponds to its absolute location in the world given in Cartesian 
coordinates. 

The latter two agents receive a reward of —1 for selecting the action to ask 
their respective oracles a question. This reward is in addition to the reward for 
executing any subsequent actions. The way this works in practice is that the 
Action Oracle Agent, on asking its oracle where to go, transitions to an adjacent 
location in the grid world based on the optimal action from its current location, 
and receives a reward of —2 (—1 for asking the oracle and —1 for the action 
executed). The State Oracle Agent, on asking its oracle where it is, sees a state 
transition from its normal representation of its current location to an absolute 
coordinate system representation of the same location, receiving a penalty of 
— 1 for asking the oracle. It then has to learn the optimal action to execute 
from this new state representation. Once it selects a physical action, its state 
representation reverts back to normal, i.e. it is formed by observing the 8 squares 
adjacent to its new location. This dual representation of the State Oracle Agent’s 
location in the world more than doubles the state space which it needs to explore. 

A range of reinforcement learning algorithms were used: SARSA, Q-learning, 
SARSA(A) with replacement traces, Watkins’s Q(A) with accumulating traces. 
For details of the learning algorithms see [6, pl46, pl49, pl81, pl84] respectively. 
All of the learning algorithms continuously updated their policies. Actions are 
selected greedily using the current policy with a probability of ( 1 — e) . In cases 
where actions have the same value, ties were broken at random. In the remaining 
e cases the action executed was selected randomly between all the available 
actions. In all cases above, random selection is uniform across all possibilities. 
State-action values for all the learning algorithms were initiated at zero. 

The following values were used for all learning algorithms: (i) learning rates (a) 
of 0.1 and 0.01, (ii) discount rate 7 = 0.9, (iii) probability of random ac- 
tion (e) started at 20% and decayed linearly reaching zero at the 200, 000*^ 
action-learning step, thereafter it remained at zero. For the learning algorithms 
SARSA(A) and Watkins’s Q(A) a range of values were tried for the eligibility 
trace decay rate (A); from 0.01 to 0.9. Results are shown for A = 0.9 which was 
in general found to perform the best. 

4 Evaluation 

To test the first hypothesis we ran the agents for 400,000 action-learning steps 
and looked at example policies learnt by individual agents to examine if they 
were making appropriate use of the oracles. 



78 



Paul A. Crook and Gillian Hayes 




Fig. 3. Example policies learnt after 400,000 action-learning steps. Arrows indicate the 
physical action selected by the policy. Squares containing filled blocks indicate where 
the policy is to consult the oracle in order to select the shown physical action 

To evaluate the second hypothesis we again ran the agents for 400,000 action- 
learning steps but evaluated the current policy after every 1000 action- learning 
steps. The policy is evaluated by executing it greedily and determining the total 
steps required to reach the goal from every possible starting position. Separate 
counts were kept of physical actions and requests to the oracle. The agent is 
limited to a maximum of 1000 steps (total of both physical actions and requests 
to oracles) to reach the goal from each starting position. There are 46 non- 
goal starting states in Sutton’s Grid World, so a policy that fails to reach the 
goal from all of them would have a maximum total number of steps of 46,000. 
Each combination of agent, world, learning algorithm and values of a and A was 
repeated 100 times giving 100 samples per data point. 

5 Results 

Fig. 3 shows examples of the policies learnt for both Action and State Oracle 
Agents, using SARSA, Q-learning and SARSA(A). Grid squares containing filled 
blocks indicate where the policy is to consult the oracle. Of the examples shown 
three have learnt that they achieve a better solution if they consult their re- 
spective oracle when they are in the three squares labelled 148 in Fig. 1. As 
was discussed in section 3, the squares labelled 148 are perceived to be identical 
but the action required for an optimal solution is not the same in each location. 
In the fourth example in Fig. 3 the agent consults the State Oracle when in 
the squares labelled 0, 2 and 16 in Fig. 1. The grid squares labelled 0 require 
different actions to be executed in different locations. The squares labelled 2 are 
aliased but require the same action to be executed in both locations. Square 16 
is not aliased. Of the examples shown, the top right (State Oracle, SARSA(A) 
with A = 0.9 and a = 0.01) is an optimal solution in terms of physical actions. 
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Fig. 4. Plot of mean total steps (sum of physical actions and requests to oracles) found 
when policies were evaluated, versus action-learning steps. To simplify plots data points 
are only shown for every 50,000 action-learning steps. Bars indicate 95% confidence 
intervals. The two inserts show enlargements of the tail of their respective plots 



Fig. 4 allows comparison of the mean total steps (sum of physical actions 
and requests to the oracles) of the three agents. Plots are shown for each of the 
learning algorithms with the exception of Watkins’s Q(A). Plots for Watkins’s 
Q(A), which are not shown due to space constraints, are very similar to those for 
SARSA(A). The top three plots show results for a = 0.1, the lower three plots 
show results for a = 0.01. 

For a = 0.01 the two 1-step backup algorithms (SARSA and Q-learning) 
fail to learn reasonable policies using any of the agents. The mean total steps 
in these six cases remains just below the maximum of 46,000 steps. Results for 
SARSA and Q-learning are better when a = 0.1, with the mean total steps 
for all three agents falling as learning progresses. Even so, the Eight Adjacent 
Squares Agent struggles to learn reasonable solutions. This is expected as 1- 
step backup of state values will cause global impairment of the learnt policies. 
The curve for the Action Oracle Agent is better in the case of SARSA but 
almost identical to the Eight Adjacent Squares Agent for Q-learning. The most 
significant difference is shown by the State Oracle Agent. It is slower to learn 
initially, due to the increase in the state space caused by the State Oracle (see 
section 3), but in the longer term it achieves much better results than either of 
the other two agents. In the two plots for SARSA(A), the total mean steps of all 
three agents reduce rapidly, convergence occurring more quickly with a = 0.1 
than a = 0.01, as would be expected. Inserts, which show the tail of both of 
these plots, indicate that for a = 0.1 there is no distinction between the mean 
total steps for the three types of agent. However, for a = 0.01 the average policy 
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learnt by the eight adjacent squares agent is worse than that for the two agents 
which can use oracles. 

To get a clearer picture of what is occurring we classify the policies that 
have been learnt into five categories: optimal, better than memory less optimal, 
memoryless optimal, other satisficing and non-satisficing. We define these terms 
specifically for Sutton’s grid world in terms of the total physical actions taken 
when the policy is evaluated. The optimal policy for Sutton’s grid world is 404 
physical actions. Littman [7] showed that the optimal memory less solution for 
Sutton’s grid world is 416 physical actions. Note that these terms are defined in 
terms of physical actions and exclude requests to oracles. This is in order to make 
a level comparison between agents with oracles and those without. Littman’s [7] 
definition of a satisficing policy is one that reaches the goal from all possible start 
states. Our measure of satisficing is stricter than this requirement as the agent 
is limited to 1,000 actions from each start state, after which the run is truncated 
and the policy deemed to be non-satisficing. Note that this measure includes 
requests to oracles as it is possible for an agent to select no physical actions 
and spend all its time talking to the oracle. The other satisficing class contains 
those policies that are satisficing but exceed the number of physical action steps 
required by the other categories. The five categories are summarised in table 1. 

Fig. 5 shows the categorisation of policies from 100 trials over the course of 
400,000 action-learning steps. Top left shows the policies learnt by the State Or- 
acle Agent using Q-learning with a = 0.1. By the end of the trials the majority of 
policies are classified as other satisficing (72 policies), with 4 that are memoryless 
optimal, 20 that are better than memoryless optimal and 4 non-satisficing. No 
optimal policies have been learnt. These results are better than expected as the 
State Oracle only addresses the issue of local action selection, not that of global 
impairment. The success of this agent with 1-step backup learning algorithms is, 
however, heavily dependent on the value of a as no satisficing policies are learnt 
for a = 0.01. The categorisation of policies learnt by the State Oracle Agent us- 
ing SARSA (not shown) is very similar with final results of 79 other satisficing, 6 
memoryless optimal, 14 better than memoryless optimal, and 1 non-satisficing. 

Fig. 5 also shows the categorisation of policies for SARSA(A) for each of 
the three agents. Separate plots are shown for a = 0.1 and 0.01 for the Action 
Oracle and State Oracle Agents. For the Eight Adjacent Squares Agent, which 
has no access to an oracle, with a = 0.1 the majority of solutions are memoryless 
optimal, 83 at the end of the 100 trials, with 4 other satisficing policies and 13 
non-satisficing. The results are similar for this agent with a = 0.01 (not shown), 
68 memoryless optimal, 11 other satisficing and 21 non-satisficing. This is in line 
with our expectations, as Loch and Singh [9] showed that the use of eligibility 
traces allowed agents to find optimal memoryless solutions. 

With a = 0.1 the Action Oracle Agent initially generates a small number of 
optimal policies, however these quickly disappear indicating that they are not 
stable and by the end the majority of policies (65) are memoryless optimal, 19 
are better than memoryless optimal, 1 other satisficing, and 15 non-satisficing. 
The State Oracle Agent learns a large number (73) of better than memoryless 
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Fig. 5. Plots show categorisation of policies versus action-learning steps. Height of 
shaded areas indicate numbers of policies out of a total of one hundred that fall into 
each classification 



Table 1. Policy categories for Sutton’s Grid World 



Goal Reached From 
All Start States 


Total Physical 
Actions 


Policy Gategory 


yes 


404 


Optimal 


yes 


405 - 415 


Better Than Memoryless Optimal (BTMO) 


yes 


416 


Memoryless Optimal 


yes 


> 416 


Other Satisficing 


no 


- 


Non-Satisficing 



optimal policies, 7 memoryless optimal, 4 optimal, and 1 other satisficing policy, 
the remaining 15 being non-satisficing. With a = 0.01 both oracle agents learn 
significantly more optimal policies. The Action Oracle Agent learns 41 optimal 
policies, 10 better than memoryless optimal, 39 memoryless optimal, 3 other 
satisficing, and 7 non-satisficing. The State Oracle learns 31 optimal policies, 33 
better than memoryless optimal, 30 memoryless optimal, 5 other satisficing, and 
1 non-satisficing. 

Over all, the plots for SARSA(A) indicate that although the mean total 
steps for the three types of agent are close (Fig. 4), the actually policies that 
each learns varies significantly. Unlike the Eight Adjacent Squares Agent, both 
of the oracle agents learn optimal and better than memoryless optimal policies, 
especially when a lower value of a is used. 

6 Discussion Conclusion 

The locations where the oracles are consulted in Fig. 3 generally correspond to 
places where we would expect difficulties due to state aliasing. Thus it appears 
that our first hypothesis is confirmed, in that reinforcement learning algorithms 
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can learn to make good use of external resources in order to clarify the agent’s 
current state. This indicates that agents should be able to make use of active 
perception systems when solving partially observable navigation problems. 

Our second hypothesis is also not falsified as either oracle agent when com- 
bined with SARSA(A) and Watkins’s Q(A) can learn optimal policies. Plots for 
Watkins’s Q(A) are not shown; they are similar to SARSA(A), although signifi- 
cantly fewer optimal policies are learnt. 

The State Oracle Agent generates more optimal policies than the Action Or- 
acle Agent for a = 0.1. For a = 0.01 it has a lower mean total steps than the 
Action Oracle Agent whilst almost matching the number of optimal solutions 
learnt by the Action Oracle Agent. The success of the State Oracle Agent is 
encouraging since compared to the Action Oracle, which has access to a known 
optimal policy, the State Oracle has no extra information about the task. The 
results indicate that in order to aid an agent dealing with a partially observable 
task, an active perception system only has to provide non-ambiguous represen- 
tations (within the context of the task) for the current state, i.e. it does not 
have to provide any additional knowledge or reasoning about the problem. 

The frequency with which optimal policies are learnt appears to be limited 
and dependent on the value of a. Possible causes of this limit could be: (i) the 
cost of perceptual actions is too high; (ii) that the advice given by the Action 
Oracle to its agent may be of little value unless the agent’s own policy has 
converged to closely match that used by the Action Oracle; (iii) there is some 
step change in complexity between better than memoryless optimal and optimal 
policies for Sutton’s Grid World. 

We briefly looked at varying the cost of perceptual actions. As the perceptual 
action cost decreases, the number of optimum policies learnt by the Action Oracle 
Agent increases. However, as the perceptual cost tends to zero, the individual 
agents tend towards the rather uninteresting policy of always asking the Action 
Oracle what to do. Also, for the range of perceptual action rewards tried, the 
State Oracle Agent failed to learn any satisficing policies. There is no constraint 
preventing the State Oracle Agent from continuously requesting information 
from the State Oracle, and with the perceptual action cost reduced to -0.5, and 
a discount factor ( 7 ) of 0.9, it is less costly to talk to the oracle forever, than to 
attempt to reach the goal from some of the more distant locations in the grid 
world. There is a limited cost range within which the State Oracle Agent should 
learn satisficing policies, however this range is task specific and identifying it 
requires prior knowledge of the problem. This we would prefer to avoid. 

In conclusion, the oracles were introduced as an idealised active perception 
system, and their success suggests that the use of active perception should pro- 
vide a feasible approach to solving partially observable navigation problems. 



7 Future Work 

Future work flowing from this paper includes: (i) examining what prevents the 
combination of oracles and eligibility traces from generating a greater number of 
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optimal solutions; (ii) generating comparative results for agents that use memory 
or internal models; (iii) extending these results to other grid world and also non- 
minimum time problems, e.g. McCallum’s New York driving problem [10]; (iv) 
testing the performance of agents using actual active perception systems. 
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Abstract. Comparative machine learning experiments have become an 
important methodology in empirical approaches to natural language pro- 
cessing (i) to investigate which machine learning algorithms have the 
‘right bias’ to solve specihc natural language processing tasks, and (ii) to 
investigate which sources of information add to accuracy in a learning ap- 
proach. Using automatic word sense disambiguation as an example task, 
we show that with the methodology currently used in comparative ma- 
chine learning experiments, the results may often not be reliable because 
of the role of and interaction between feature selection and algorithm 
parameter optimization. We propose genetic algorithms as a practical 
approach to achieve both higher accuracy within a single approach, and 
more reliable comparisons. 

1 Introduction 

Supervised machine learning methods are investigated intensively in empirical 
computational linguistics because they potentially have a number of advan- 
tages compared to standard statistical approaches. For example, Inductive Logic 
Programming (ILP) systems allow easy integration of linguistic background 
knowledge in the learning system, induced rule systems are often more inter- 
pretable, memory-based learning methods incorporate smoothing of sparse data 
by similarity-based learning, etc. 

Frequently, research in machine learning (ML) of natural language takes the 
form of comparative ML experiments, either to investigate the role of different 
information sources in learning a task, or to investigate whether the bias of some 
learning algorithm fits the properties of natural language processing tasks better 
than alternative learning algorithms. 

For the former goal, results of experiments with and without a certain infor- 
mation source are compared, to measure whether it is responsible for a statis- 
tically significant increase or decrease in accuracy. An example is text catego- 
rization: we may be interested in investigating whether part-of-speech tagging 
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(adding the contextually correct morphosyntactic classes to the words in a docu- 
ment) improves the accuracy of a Bayesian text classification system or not. This 
can be achieved by comparing the accuracy of the classifier with and without 
the information source. 

In the latter goal, investigating the applicability of an algorithm for a type 
of task, the bias of an algorithm refers to the representational constraints and 
specific search heuristics it uses. Some examples of bias are the fact that decision 
tree learners favor compact decision trees, and that ILP systems can represent 
hypotheses in terms of first order logic in contrast to most other learning methods 
which can only represent propositional hypotheses. In such experiments, two or 
more ML algorithms are compared for their accuracy on the same data. One 
example is a comparison between eager and lazy learning algorithms for language 
tasks: we may want to show that abstracting from infrequent examples, as done 
in eager learning, is harmful to generalization accuracy [5]. 

Apart from their inherent interest, the comparative machine learning ap- 
proach has also gained importance because of the influence of competitive re- 
search evaluations such as SENSEVAL^ and the CoNLL shared tasks^, in which 
ML and other systems are compared on the same train and test data. SENSEVAL 
concerns research on word sense disambiguation, which we will use as a test case 
in this paper. 

Word Sense Disambiguation (WSD) is a natural language processing task 
in which a word with more than one sense has to be disambiguated by using 
information from the context in which the word occurs. E.g. knight can (among 
others) refer to a chess piece or a medieval character. WSD is an essential sub- 
component in applications such as machine translation (depending on the sense, 
knight will be translated into French as either cavalier or chevalier) , language 
understanding, question answering, information retrieval, and so on. Over the 
last five years, two SENSEVAL competitions have been run to test the strengths 
and weaknesses of WSD systems with respect to different words, different aspects 
of language, and different languages in carefully controlled contexts [10,8]. Ma- 
chine learning methods such as decision list learning and memory-based learning 
have been shown to outperform hand-crafting approaches in these comparisons, 
leading to a large body of comparative work of the two types discussed earlier. 

A seminal paper by Mooney on the comparison of the accuracy of different 
machine learning methods [16] on the task of WSD is a good example of this 
classifier comparison approach. He tested seven ML algorithms on their ability 
to disambiguate the word line, and made several conclusions in terms of al- 
gorithm bias to explain the results. Many more examples can be found in the 
recent NLP literature of similar studies and interpretations [17,9,15], often with 
contradictory results and interpretations. 

In the remainder of this paper, we will first describe standard methodology 
in Section 2 and show empirically that this methodology leads to conclusions 
that are not reliable for our WSD problem and for other machine learning tasks 



^ http://www.senseval.org. 

^ http://www.aclweb.org/signll. 
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inside and outside computational linguistics (Section 3). In Section 4 we show 
that the joint optimization of feature selection and algorithm parameters using 
a genetic algorithm is computationally feasible, leads in general to good results, 
and could therefore be used both to achieve higher accuracy and more reliable 
comparisons. Section 5 discusses our results in the light of related research. 

2 Limitations of Standard Methodology 

Crucial for objectively comparing algorithm bias and relevance of information 
sources is a methodology to reliably measure differences and compute their statis- 
tical significance. A detailed methodology has been developed for this [21] involv- 
ing approaches like k-fold cross-validation [11,1,7] to estimate classifier quality 
(in terms of accuracy or derived measures like precision, recall, and F-score), as 
well as statistical techniques like McNemar [7] and paired cross-validation t-tests 
for determining the statistical significance of differences between algorithms or 
between presence or absence of information sources. Although this methodol- 
ogy is not without its problems [18], it is generally accepted and used both in 
machine learning and in most work in statistical NLP. 

Many factors potentially play a role in the outcome of a (comparative) ma- 
chine learning experiment: the data used (the sample selection and the sample 
size), the information sources used (the features selected) and their representa- 
tion (e.g. as nominal or binary features), and the algorithm parameter settings 
(most ML algorithms have various parameters that can be tuned). 

In a typical comparative machine learning experiment, two or more algo- 
rithms are compared for a fixed sample selection, feature selection, feature rep- 
resentation, and (default) algorithm parameter setting over a number of trials 
(cross-validation), and if the measured differences are statistically significant, 
conclusions are drawn about which algorithm is better suited to the problem 
being studied and why (mostly in terms of algorithm bias)^. Sometimes differ- 
ent sample sizes are used to provide a learning curve, and sometimes parameters 
of (some of the) algorithms are optimized on training data, or heuristic feature 
selection is attempted, but this is exceptional rather than common practice in 
comparative experiments. Interactions between different factors, like the effect of 
interleaved feature selection and algorithm parameter optimization, have to the 
best of our knowledge not yet been investigated systematically in comparative 
machine learning experiments for language processing problems. 

In the remainder of this paper, we test the hypotheses that (i) feature selec- 
tion, algorithm parameter optimization, and their joint optimization cause larger 
differences in accuracy within a single algorithm than differences observed be- 
tween different algorithms using default parameter settings and feature input, 
and (ii) that the effect of adding and removing an information source when using 
default parameters can be reversed when re-optimizing the algorithm parame- 
ters. The implication of evidence for these hypotheses is that a large part of 
the comparative machine learning of language literature may not be reliable. 

^ A similar approach is taken for the comparison of information sources. 
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Another implication is that joint optimization can lead to significantly higher 
generalization accuracies, but this issue is not the focus of this paper. 

3 Feature Selection, Parameter Optimization, 
and Their Interaction 

In this Section, we analyze the impact of algorithm parameter optimization, fea- 
ture selection, and the interaction between both on classifier accuracy in com- 
parative experiments on WSD data and on the UCI benchmark datasets. 

Feature (subset) selection is the process in which a subset of the available 
predictor features defining the input of the classification task are removed if they 
cannot be shown to be relevant in solving the learning task [11]. For computa- 
tional reasons we used a backward selection algorithm. We start with all available 
features and look at the effect on accuracy of deleting one of the features, and 
continue deleting until no more accuracy increase is reported. Algorithm pa- 
rameter optimization is the process in which parameters of a learning system 
(e.g. learning rate for neural networks, or the number of nearest neighbors in 
memory-based learning) , are tuned for a particular problem. Although most ma- 
chine learning systems provide sensible default settings, it is by no means certain 
that they will be optimal parameter settings for some particular task. In both 
cases (feature selection and parameter optimization), we are performing a model 
selection task which is well-known in machine learning. But as we mentioned ear- 
lier, whereas some published work in computational linguistics discusses either 
feature selection for some task, or algorithm parameter optimization for oth- 
ers, the effects of their interaction have, as far as we know, never been studied 
systematically. 

The general set-up of our experiments is the following. Each experiment is 
done using a 10-fold cross-validation on the available data. This means that 
the data is split in 10 partitions, and each of these is used once as test set, 
with the other nine as corresponding train set. For each dataset, we provide 
information about the accuracy of two different machine learning systems under 
four conditions: 

1. Using their respective default settings. 

2. After optimizing the feature subset selection (backward selection) using 
default parameter settings, for each algorithm separately. (Optimization 
step 1). 

3. After optimizing the algorithm parameters for each algorithm individually. 
Each “reasonable” parameter setting is tested with a 10-fold cross-validation 
experiment. (Optimization step 2). 

4. After performing feature selection interleaved with optimization of the pa- 
rameters for each algorithm in turn. (Optimization step 3). 

We expect from our first hypothesis that each optimization step can increase 
the accuracy of the best result (as measured by the average result over the 10 
experiments) considerably. In general, we expect the differences we record for the 
same algorithm over the four conditions to be much larger than the difference 
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between the two learning algorithms when using default settings. As we are 
primarily interested in showing the variability of results due to the different 
optimizations, all results reported will be cross-validation results, and not results 
on an additional held-out dataset (this would imply an additional cross-validation 
loop within the cross-validation loop, which is computationally infeasible). For 
the WSD data we will report results on test datasets used by SENSEVAL which 
give an indication of the usefulness of the approach for improving accuracy. 



3.1 Machine Learning Methods and Data 



We chose two machine learning techniques for our experiments: the memory- 
based learning package timbl [6]^, and the rule induction package ripper [3]. 
These two approaches provide extremes of the eagerness dimension in ML (the 
degree in which a learning algorithm abstracts from the training data in forming 
a hypothesis) . timbl is an example of lazy learning, ripper of eager learning. 

For TIMBL the following algorithm parameters were optimized: similarity met- 
rics, feature weighting metrics, class voting weighting, and number of nearest 
neighbors (varied between 1 and 45). For ripper the following parameter set- 
tings were varied: class ordering, negative tests in the rule conditions, hypothe- 
sis simplification magnitude, and example coverage. See [6,3] for explanation of 
these parameters. 

The “line” data set has become a benchmark dataset for work on word sense 
disambiguation. It was first produced and described by Leecock, Towell, and 
Voorhees [14]. It consists of instances of the word line, taken from the 1987-89 
Wall Street Journal and a 25 million word corpus from the American Printing 
House for the Blind. 4,149 examples of occurrences of line were each tagged 
with one of six WordNet senses: text, formation, division, phone, cord, product. 
Because the “product” sense is 5 times more common than any of the other 
senses, a sampled dataset was also used in which all senses are represented 
equally. For each sense, 349 instances were selected at random, producing a total 
of 2094 examples with each sense having an equal share. This kind of sampling 
has been done before, and has also been reported in the literature [16]. However, 
we made our own sample as the other samples were not readily available. This 
means that our sampled data results cannot be compared directly to those of 
other systems. 

Here is an example of an instance representing one occurrence of the word 
line in the corpus: 



line-n. w9_15: 17036:, pen, NN, writing, VBG, those, DT, lines, line, NNS, was, 
VBD, that, IN, of, IN, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0,0, 
0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 
0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 
0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 
0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 
0,0, 0,0, 0,0, 0,0, 0,0, 0,0, text. 



Available from http://ilk.knb.nl. 
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Table 1. Results of timbl and ripper on different WSD data sets when using (i) de- 
fault settings, (ii) backward selection, (iii) parameter optimization, and (iv) interleaved 
backward selection and parameter optimization. 



“line” (complete) 




TIMBL 


RIPPER 


WORDS 


(default) 


60.2 


63.9 




(feat, sel.) 


62.7 


63.9 




(param. opt.) 


63.4 


70.2 




(interleaved opt.) 


64.5 


91.3 


WORDS-tPOS 


(default) 


57.8 


63.8 




(feat, sel.) 


62.7 


64.7 




(param. opt.) 


64.3 


71.6 




(interleaved opt.) 


64.9 


76.4 


“line” (sampled) 




TIMBL 


RIPPER 


WORDS 


(default) 


59.1 


40.4 




(feat, sel.) 


60.3 


40.9 




(param. opt.) 


66.4 


61.2 




(interleaved opt.) 


66.7 


63.3 


WORDS-tPOS 


(default) 


56.9 


41.4 




(feat, sel.) 


61.5 


41.6 




(param. opt.) 


67.3 


60.5 




(interleaved opt.) 


68.1 


61.1 



The first entry {line-n.w9A 5:17036:) is an ID tag for the instance, and is 
ignored in the learning process. Next are the three context words occurring 
before the focus word, together with their parts of speech. Then follows the 
form of the word Zme, in this case the plural, together with its base form (line), 
and its part of speech {NNS). Then we can see the right context of the focus 
words, also with parts of speech. The next 200 features are binary features, each 
indicating the presence or absence of the 200 most salient context words of line. 

3.2 Results on the WSD Datasets 

In Table 1, if we focus on the variation for a single algorithm over the four con- 
ditions, we can see that parameter optimization, feature selection, and combined 
feature selection with parameter optimization lead to major accuracy improve- 
ments compared to the results obtained with default parameter settings. These 
‘vertical’ accuracy differences are much larger than the ‘horizontal’ algorithm- 
comparing accuracy differences. 

The fact that we could observe large standard deviations in the optimization 
experiments, also confirms the necessity of parameter optimization (only the 
best result is represented in Table 1 for each optimization step) . 

With respect to the selected parameter settings and feature combinations, we 
found that parameter settings which are optimal when using all features are not 
necessarily optimal when performing feature selection. Furthermore, we could 
observe that the feature selection considered to be optimal for timbl was often 
different from the one optimal for ripper. 
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Table 2. Results of timbl and ripper on different UCI data sets when using (i) de- 
fault settings, (ii) backward selection, (iii) parameter optimization, and (iv) interleaved 
backward selection and parameter optimization. 



Dataset 




TIMBL 


RIPPER 


database for fitting contact lenses 


(default) 


75.0 


79.2 




(feat, sel.) 


87.5 


87.5 




(param. opt.) 


87.5 


87.5 




(interleaved opt.) 


87.5 


87.5 


contraceptive method choice 


(default) 


48.5 


46.8 




(feat, sel.) 


52.2 


48.2 




(param. opt.) 


54.2 


49.8 




(interleaved opt.) 


54.8 


49.8 


breast-cancer- Wisconsin 


(default) 


95.7 


93.7 




(feat, sel.) 


96.3 


95.3 




(param. opt.) 


97.4 


95.7 




(interleaved opt.) 


97.6 


95.7 


car evaluation database 


(default) 


94.0 


87.0 




(feat, sel.) 


94.0 


87.0 




(param. opt.) 


96.9 


98.4 




(interleaved opt.) 


96.9 


98.4 


postoperative patient data 


(default) 


55.6 


71.1 




(feat, sel.) 


71.1 


71.1 




(param. opt.) 


71.1 


71.1 




(interleaved opt.) 


71.1 


71.1 



We conclude that we have found evidence for our hypothesis (i) that the 
accuracy differences between different machine learning algorithms using stan- 
dard comparative methodology will in general be lower than the differences in 
accuracy resulting from interactions between algorithm parameter settings and 
information source selection, at least for this task (see [4] for similar results on 
other language datasets). 



3.3 Results on the UCI Benchmarks 

We investigated whether the effect is limited to natural language processing 
datasets by applying the same optimalization to 5 UCI benchmark datasets®: 
“database for fitting contact lenses” (24 instances), “contraceptive method 
choice” (1473 instances) , “breast-cancer- Wisconsin” (699 instances), “car evalua- 
tion database” (1728 instances) and “postoperative patient data” (90 instances). 
Compared to our language processing datasets, these datasets are small. From 
the results in Table 2, we nevertheless see the same effects: the default settings 
for the algorithms are not optimal; the difference in accuracy for a single al- 
gorithm in the four conditions generally overwhelms accuracy differences found 
between the algorithms, and in cases like the “car evaluation database”, we see 

http:/ /www. ics.uci.edu/ mlearn/MLRepository.html. 
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that the initial result (timbl outperforms ripper) is reversed after optimization. 
Similar effects explain why in the ML of natural language literature, so many 
results and interpretations about superiority of one algorithm over the other are 
contradictory. 

4 Genetic Algorithms for Optimization 

Our results of the previous Section show that a proper comparative experiment 
requires extensive optimization of a combinatorially explosive nature, and that 
the obtainable accuracy increase by going to this trouble are considerable. Opti- 
mization and model selection problems of the type described in this paper are of 
course not unique to machine learning of language. Solutions like genetic algo- 
rithms (GAs) have been used for a long time as domain-independent techniques 
suitable for exploring optimization in large search spaces such as those described 
in this paper. We applied this optimization technique to our datasets. 

The evolutionary algorithm used to optimize the feature selection and pa- 
rameter optimization employs an algorithmic scheme similar to that of evolution 
strategies: a population of ^ individuals forms the genetic material from which A 
new individuals are created using crossover and mutation. The ^ best individuals 
of this bigger temporary population are selected to become the next generation 
of the algorithm. 

An individual contains particular values for all algorithm parameters and 
for the selection of the features. E.g., for timbl, the large majority of these 
parameters control the use of a feature (ignore, weighted overlap, modified value 
difference), and are encoded in the chromosome as ternary alleles. At the end of 
the chromosome the 5-valued weighting parameter w and the 4-valued neighbor 
weighting parameter d are encoded, together with the k parameter which controls 
the number of neighbors. The latter is encoded as a real value which represents 
the logarithm of the number of neighbors. The quality or fitness of an individual 
is the classification result returned by timbl with these particular parameter 
settings. A similar approach is followed for encoding the ripper parameters 
into an individual. 

The initial population is filled with individuals consisting of uniformly sam- 
pled values. The mutation operator replaces, independently for each position 
and with a small probability, the current value with an arbitrary other value. 
The mutation rates of the features are set independently of that of w and d. 
In the case of the k parameter, Gaussian noise is added to the current value. 
The crossover operators used are the traditional 1-point, 2-point and uniform 
crossovers. They operate on the whole chromosome. The selection strength can 
be controlled by tuning the proportion ^/A; an alternative strategy chooses the 
/i best individuals from the combination of ^ parents and A children. The GA 
parameters were set using limited explorative experimentation. We are aware 
that the algorithm parameter optimization problem we try to solve with GAs 
also applies to the GAs themselves. 
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Table 3. Validation results for timbl on five word experts, for datasets with and 
without keyword information. For the smaller datasets, interleaved backward keyword 
selection and parameter optimization results are included and are shown to be worse 
than those of the GA. For the larger dataset, only the GA could be used to perform 
interleaved optimization because the other method had become too computationally 
expensive, senseval test set results are between brackets for default vs. GA with 
keywords. 



WE 


Words+POS 




Def. 


Opt. 


GA 


bar 


48.1 


55.3 


66.3 


channel 


60.9 


70.5 


73.9 


develop 


19.3 


29.6 


29.6 


natural 


42.8 


52.7 


58.9 


post 


60.2 


66.5 


75.6 


WE 


Words + POS -1- Keywords 




Def. 




GA 


bar 


44.8 (47.0) 




66.9 (59.6) 


channel 


63.3 (50.7) 




75.4 (53.4) 


develop 


17.0 (37.7) 




29.6 (29.0) 


natural 


40.3 (31.1) 




61.3 (43.7) 


post 


57.4 (51.9) 




77.8 (58.2) 



4.1 Results 

The WSD data sets discussed in Table 3 were selected from the SENSEVAL-2 data, 
which provided training and test material for different ambiguous words. Each 
word was given a separate training and test set. We chose five of these words 
randomly, taking into account the following restrictions: at least 150 training 
items should be available, and the word should have at least 5 senses, each 
sense being represented by at least 10 training items. This process came up 
with the words bar, channel, develop, natural, post. Instances were made for each 
occurrence of each word in the same way as for the line data. 

We see that the GA succeeds in finding solutions that are significantly better 
than the default solutions and the best solutions obtained by a heuristic com- 
bined feature selection and algorithm parameter optimization approach. The 
main advantage of the GA is that it allows us to explore much larger search 
spaces, for this problem e.g., also the use of context keywords, which would be 
computationally impossible with the heuristic methods in Section 3. 

For the words with POS and keywords results we added between brackets for 
the default and GA results the results on the senseval test sets, showing that 
the optimization can indeed be used not only for showing the variability of the 
results, but also for obtaining higher predictive accuracy (although this should 
ideally be shown using two embedded cross-validation loops which turned out 
to be computationally infeasible for our data). 
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Table 4. Results of timbl with default settings and after interleaved feature selection 
and parameter optimization with a GA on the different WSD data sets for different 
information sources. 





Default 


GA 








words-l-POS 






words-l-POS 


dataset 


words words-l-POS 


-I- keywords 


words words-l-POS 


4-keywords 


“bar ” 


50.0 


48.1 


44.8 


56.4 


66.3 


66.9 


“channel” 


62.3 


60.9 


63.3 


72.0 


73.9 


75.4 


“develop” 


16.3 


19.3 


17.0 


34.8 


29.6 


29.6 


“natural” 


41.6 


42.8 


40.3 


55.6 


58.9 


61.3 


“post” 


62.5 


60.2 


57.4 


71.0 


75.6 


77.8 


“line” (sampled) 


59.1 


56.9 


57.0 


66.9 


66.9 


66.9 



4.2 Results on the Comparison of Information Sources 

In Table 4, we find evidence for our second hypothesis (the effect of adding 
an information source can switch between positive and negative depending on 
the optimization). E.g. where results with the default settings would lead to 
a conclusion that keyword features don’t help for most WSD problems except 
“channel” , the GA optimization shows that combinations of parameter settings 
and feature selection can be found for all WSD problems except “develop” which 
show exactly the opposite. 



5 Related Research and Conclusion 

Most comparative ML experiments, at least in computational linguistics, explore 
only one or a few points in the space of possible experiments for each algorithm 
to be compared. We have shown that regardless of the methodological accuracy 
with which the comparison is made, there is a high risk that other areas in the 
experimental space may lead to radically different results and conclusions. In 
general, the more effort is put in optimization (in this paper by exploring the 
interaction between feature selection and algorithm parameter optimization), the 
better the results will be, and the more reliable the comparison will be. Given the 
combinatorially explosive character of this type of optimization, we have chosen 
for GAs as a computationally feasible way to achieve this; no other heuristic 
optimization techniques allow the complex interactions we want to optimize. As 
a test case we used WSD datasets. In previous work [4] we showed that the same 
effects also occur in other tasks, like part of speech tagging and morphological 
synthesis. 

The current paper builds on results obtained earlier on WSD [19,20] in which 
we found that independent optimization of algorithm parameters for each word 
to be disambiguated led to higher accuracy, which at one point we thought to 
be a limitation of the method used (memory-based learning). In this paper, we 
show that the problem is much more general than for a single algorithm (e.g. 
RIPPER behaves similarly). We also showed in this paper that feature selection 
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and algorithm parameter optimization interact highly, and should be jointly 
optimized. We also build on earlier, less successful attempts to use GAs for op- 
timization in memory-based learning [12,13]. GAs have been used for parameter 
optimization in ML a great deal, including for memory-based learning. A dif- 
ferent discussion point concerns the lessons we have to draw from the relativity 
of comparative machine learning results. In an influential recent paper, Banko 
and Brill [2] conclude that “We have no reason to believe that any comparative 
conclusions drawn on one million words will hold when we finally scale up to 
larger training corpora” . They base this point of view on experiments compar- 
ing several machine learning algorithms on one typical NLP task (confusable 
word disambiguation in context) with data selection sizes varying from 1 mil- 
lion to 1 billion. We have shown in this paper that data sample size is only one 
aspect influencing comparative results, and that accuracy differences due to al- 
gorithm parameter optimization, feature selection, and especially the interaction 
between both easily overwhelm the accuracy differences reported between algo- 
rithms (or information sources) in comparative experiments. Like the Banko and 
Brill study, this suggests that published results of comparative machine learning 
experiments (and their interpretation) may often be unreliable. 

The good news is that optimization of as many factors as possible (sample 
selection and size, feature selection and representation, algorithm parameters), 
when possible, will offer important accuracy increases and (more) reliable com- 
parative results. We believe that, in the long term, a GA approach offers a 
computationally feasible approach to this huge optimization problem. 
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Abstract. Reinforcement learning aims to determine an (infinite time 
horizon) optimal control policy from interaction with a system. It can 
be solved by approximating the so-called Q-function from a sample of 
four-tuples {xt,Ut,rt, Xt+i) where Xt denotes the system state at time t, 
Ut the control action taken, rt the instantaneous reward obtained and 
Xt+i the successor state of the system, and by determining the optimal 
control from the Q-function. Classical reinforcement learning algorithms 
use an ad hoc version of stochastic approximation which iterates over the 
Q-function approximations on a four-tuple by four-tuple basis. In this pa- 
per, we reformulate this problem as a sequence of batch mode supervised 
learning problems which in the limit converges to (an approximation of) 
the Q-function. Each step of this algorithm uses the full sample of four- 
tuples gathered from interaction with the system and extends by one step 
the horizon of the optimality criterion. An advantage of this approach is 
to allow the use of standard batch mode supervised learning algorithms, 
instead of the incremental versions used up to now. In addition to a the- 
oretical justification the paper provides empirical tests in the context of 
the “Car on the Hill” control problem based on the use of ensembles of 
regression trees. The resulting algorithm is in principle able to handle 
efficiently large scale reinforcement learning problems. 



1 Introduction 

Many interesting problems in many fields can be formulated as closed-loop con- 
trol problems, i.e. problems whose solution is provided by a mapping (or a control 
policy) Ut = h^{xt) where Xt denotes the state at time t of a system and Ut an 
action taken by a controlling agent so as to influence the instantaneous and fu- 
ture behavior of the system. In many cases these problems can be formulated 
as infinite horizon discounted reward discrete-time optimal control problems, i.e. 
problems where the objective is to find a (stationary) control policy /i*(-) which 
maximizes the expected return over an infinite time horizon defined as follows: 
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N-1 






( 1 ) 



where 76 [ 0 , 1 [ is the discount factor, is an instantaneous reward signal which 
depends only on the state Xt and action Ut at time t, and where the expectation 
is taken over all possible system trajectories induced by the control policy 

Optimal control theory, and in particular dynamic programming, aims to 
solve this problem “exactly” when the explicit knowledge of system dynamics 
and reward function are given a priori. In this paper we focus on reinforcement 
learning (RL), i.e. the use of automatic learning algorithms in order to solve the 
optimal control problem “approximately” when the sole information available is 
the one we obtain from system transitions from t to t+1. Each system transition 
provides the knowledge of a new four-tuple {xt,ut,rt,xt+i) of information and 
we aim here to compute ir*(.) from a sample r^, x^_^_i),k = 

of such four-tuples. 

It is important to contrast the RL protocol with the standard batch mode 
supervised learning protocol, which aims at determining, from the sole informa- 
tion of a sample S of input-output pairs (z,o), a function h* £ T-L {T-L is called 
the hypothesis space of the learning algorithm) which minimizes the expected 
approximation error, e.g. defined in the case of least squares regression by the 
following functional: 

Err'^= ^ m)-o\\ (2) 

(z,o)£S 

Notice that the use of supervised learning in the context of optimal control prob- 
lems would be straightforward if, instead of the sample T of four-tuples, we could 
provide the learning algorithm with a sample of input-output pairs {x, fi*{x)) (see 
for example [9] for a discussion on the combination of such a scheme with re- 
inforcement learning). Unfortunately, in many interesting control problems this 
type of information can not be acquired directly, and the specific difficulty in 
reinforcement learning is to infer a good approximation of the optimal control 
policy only from the information given in the sample T of four-tuples. Existing 
reinforcement learning algorithms can be classified into two categories: 

— Model based RL methods: they use (batch mode or incremental mode) su- 
pervised learning to determine from the sample T of four-tuples on the one 
hand an approximation of the system dynamics: 



fx{x, u, x') Ks P{xt+i = x'\xt = x,ut = u) (3) 



and on the other hand an approximation of the expected reward function: 



/ 2 (x,m) Ri E{rt\xt = x,ut = u}. 



(4) 



Once these two functions have been obtained, model based algorithms derive 
the optimal control policy by dynamic programming [5,8]. 
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— Non-model based RL methods: they use incremental mode supervised learn- 
ing in order to determine an approximation of the Q-function associated to 
the control problem. This function is (implicitly) defined by the following 
equation (known as the Bellman equation): 



Q{x, u) = E < rt-\- 7 maxQ(xt+i, u') 

I u' 



Xt = X,Ut = 




(5) 



The optimal control policy can be directly determined from this (unique) 
Q-function by the following relation 



/i*(x) = argmax(5(x, m). (6) 

U 

The most well-known algorithm falling into the latter category is the so- 
called Q-learning method [11]. 

Our proposal is based on the observation that neither of these two approaches are 
able to fully exploit the power of modern supervised learning methods. Indeed, 
model based approaches are essentially linked to so-called state space discretiza- 
tion which aims at building a finite Markov Decision Problem (MDP) and are 
strongly limited by the curse of dimensionality: in order to use the dynamic pro- 
gramming algorithms, the state and control spaces need to be discretized and 
the number of cells of any discretization scheme increases exponentially with 
the number of dimensions of the state space. Non-model based approaches have, 
to our best knowledge, been combined only with incremental (on-line) learning 
algorithms (see e.g. [10]). 

With respect to these approaches, we propose a novel non-model based RL 
framework which is able to exploit any generic batch mode supervised learning 
algorithm to model the Q-function. The resulting algorithm is illustrated on a 
simple problem where it is combined with three supervised learning algorithms 
based on regression trees. The rest of the paper is organized as follows: Section 
2 introduces the underlying idea of our approach and gives a precise description 
of the proposed algorithm; Section 3 provides a validation in the context of the 
“Car on the Hill” control problem; Section 4 provides discussions, directions for 
future research and conclusions. 



2 Iteratively Extending Time Horizon in Optimal Control 

The approach that we present is based on the fact that the optimal (stationary) 
control policy of an infinite horizon problem can be formalized as the limit of a 
sequence of finite horizon control problems, which can be solved in an iterative 
fashion by using any standard supervised learning algorithm. 

2.1 Iteratively Extending time Horizon in Dynamic Programming 

We consider a discrete-time stationary stochastic system defined by its dynamics, 
i.e. a transition function defined over the Cartesian product X x U xW oi the 
state space X, the control space U, and the disturbance space W: 

Xt+l = f{Xt,Ut,Wt), 



(7) 
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a reward signal also defined over X xU xW\ 

Tt = r{xt,ut,wt), ( 8 ) 

a noise process defined by a conditional probability distribution: 

Wt ^ Pw{w = wt\x = Xt,U = Ut), ( 9 ) 

and a probability distribution over the initial conditions: 

Xq ^ Pxi.X = Xq). ( 10 ) 

For a given (finite) horizon N , let us denote by 

7r7v(t, x) G [/, t G {0, . . . , — 1}; a: € X (11) 

a (possibly time varying) X-step control policy (i.e. Ut = TTN(t,xt)), and by 

N-l 

j;r = E{j2i^n} (12) 

t=o 

the X-step reward of the closed-loop system using this policy. An X-step optimal 
policy is a policy which among all possible such policies maximizes for any 
Px on the initial conditions. Notice that (under mild conditions) such a policy 
always does indeed exist although it is not necessarily unique. 

Our algorithm exploits the following properties of X-step optimal policies 
(these are classical results of dynamic programming theory [1]): 

1. The sequence of policies obtained by considering the sequence of Qi-functions 
iteratively defined by 

Qi{x,u) = E{rt\xt = x,ut = u} (13) 

and 

Qn{x,u) = E \rt + jmayiQN-i{xt+i,u') 

I u' 

and the following two conditions^ 

7r)(f(0,a;) = argmaxQAr(a;,M),VX > 0 (15) 

U 

and 

TT*r^{t + 1, x) = 7 t)V_i ( t, x), VX > 1, t G {0, . . . , X - 2} (16) 

is optimal. 

2. The sequence of stationary policies defined by fJ.*j^{x) = n]^{0,x) converges 
(globally, and for any P^ on the initial conditions) to /i* (x) in the sense that 

limJ^"=J^‘. (17) 

Af->oo 

3. The sequence of functions Qn converges to the (unique) solution of the 
Bellman equation (eqn. (5)). 

^ Actually this definition does not necessarily yield a unique policy, but any policy 
which satisfies this condition is appropriate, and it is straightforward to define a 
procedure constructing such a policy from the sequence of Qi-functions. 



Xt = X, Ut = m| , VX > 1, (14) 
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2.2 Iteratively Extending Time Horizon in Reinforcement Learning 

The proposed algorithm is based on the use of supervised learning in order to 
produce a sequence Qi of approximations of the Qi-functions defined above, by 
exploiting at each step the full sample of four-tuples = {x^, r^, x^_^i),k = 

in batch mode together with the function produced at the preceding 

step. 

Initialization. The algorithm starts by using the sample T of four-tuples in 
order to construct an approximation of Qi(x, u). This can be achieved using the 
XttUt components of each four-tuple as inputs, and the rt component as output 
and by using a supervised regression algorithm in order to find in its hypothesis 
space H a function satisfying 

i 

Qi = argmin V (18) 

Iteration. Step i (i > 1) of the algorithm uses the function produced at step 
z — 1 to modify the output of each input-output pair associated to each four-tuple 

= Tt + 7maxQ,_i(a:t^+i, u') (19) 

u' 

and then applies the supervised learning algorithm to build 

e 

Qi = argmin^ |/i(xj^,uJ^) - (20) 

k=i 

Stopping Conditions. For the theoretical sequence of policies an error bound 
on the sub-optimality in terms of the number of iterations is given by the fol- 
lowing equation 

( 21 ) 

1-7 

where Br > supr(x, tt, w). This equation can be used to fix an upper bound on 
the number of iterations for a given a priori fixed optimality gap. 

Another possibility is to exploit the convergence property of the sequence of 
Qi-functions in order to decide when to stop the iteration, e.g. when 

\Qn — Qat-i| < £■ (22) 

Control Policy Derivation. The final control policy seen as an approximation 
of the optimal stationary closed-loop policy is in principle derived by 

fi* (x) = fl%{x) = axgmaxQN^XjU). (23) 

U 

If the control space is finite, this can be done using exhaustive search. Other- 
wise, the algorithm to achieve this will depend on the type of approximation 
architecture used. 
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Consistency. It is interesting to question under which conditions this algorithm 
provides consistency, i.e. under which conditions the sequence of policies gener- 
ated by our algorithm and using a sample of increasing size would converge 
to the optimal control policy within a pre-specified optimality gap. Without 
any assumption on the used supervised learning algorithm and on the sampling 
mechanism nothing can be said about consistency. On the other hand, if each 
one of the true Q^-functions can be arbitrarily well approximated by a function 
of the hypothesis space and if the sample (in asymptotic regime) contains an 
infinite number of times each possible state-action pair (x,u), then consistency 
is ensured trivially. Further research is necessary in order to determine less ideal 
assumptions both on the hypothesis space and on the sampling mechanism which 
would still guarantee consistency. 



Solution Characterization. Another way to state the reinforcement learning 
problem would consist of defining the approximate Q-function as the solution of 
the following equation 



Q = argminy^ ^h{x^,u^) - (r^ + 



h£U 






(24) 



Our algorithm can be viewed as an iterative algorithm to solve this minimization 
problem starting with an initial guess Qq{x,u) = 0 and at each iteration f > 0 
updating the function according to 



Qi = arg min 
hen 



E 

/c=l 



\Kxt,Ut) - +"fmAxQi_i{xt_^_i,u)^ . (25) 



2.3 Supervised Regression Algorithm 

In principle, the proposed framework can be combined with any available super- 
vised learning method designed for regression problems. In order to be practical, 
the desirable features of the used algorithm are as follows: 

— Computational efficiency and scalability of the learning algorithm. Specially 
with respect to sample size and dimensionality of the state space X and the 
control space U. 

— Modeling flexibility. The Qj-functions to be modeled by the algorithm are 
unpredictable in shape; hence no prior assumption can be made on the para- 
metric shape of the approximation architecture, and the automatic learning 
algorithm should be able to adapt its model by itself to the problem data. 

— Reduced variance, in order to work efficiently in small sample regimes. 

— Fully automatic operation. The algorithm may be called several hundred 
times and it is therefore not possible to ask for a human operator to tune 
some meta-parameters at each step of the iterative procedure. 

— Efficient use of the model, in order to derive the control from the Q-function. 
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In the simulation results given in the next section, we have compared three 
learning algorithms based on regression trees which we think offer a good com- 
promise in terms of the criteria established above. We give a very brief description 
of each variant below. 

Regression Trees. Classification and regression trees are among the most pop- 
ular supervised learning algorithms. They combine several characteristics such 
as interpretability of the models, efficiency, flexibility, and fully automatic op- 
eration which make them particularly attractive for this application. To build 
such trees, we have implemented the CART algorithm as described in [4]. 

Tree Bagging. One drawback of regression trees is that they suffer from a high 
variance. Bagging [2] is an ensemble method proposed by Breiman that often 
improves very dramatically the accuracy of trees by reducing their variance. 
With bagging, several regression trees are built, each from a different bootstrap 
sample drawn from the original learning sample. To make a prediction with this 
set of M trees, we simply take the average predictions of these M trees. Note 
that, while bagging inherits several advantages of regression trees, it increases 
their computing times significantly. 

Extremely Randomized Trees (Extra-trees). Besides bagging, several 
other methods to build tree ensembles have been proposed that often improve 
the accuracy with respect to tree bagging (e.g. random forests [3]). In this paper, 
we propose to evaluate our own recent proposal which is called “Extra-trees”. 
Like bagging, this algorithm works by taking the average predictions of several 
trees. Each of these trees is built from the the original learning sample by se- 
lecting its tests fully at random. The main advantages of this algorithm with 
respect to bagging is that it is computationally much faster (because of the ex- 
treme randomization) and also often more accurate. For more details about this 
algorithm, we refer the interested reader to [6,7]. 

3 Illustration: “Car on the Hill” Control Problem 

The precise definition of the test problem is given in the appendix. It is a version 
of a quite classical test problem used in the reinforcement learning literature. 

A car is traveling on a hill (the shape of which is given by the function H (p) of 
Figure 3b). The objective is to bring the car in minimal time to the top of the hill 
(p = 1 in Figure 3b). The problem is studied in discrete-time, which means here 
that the control variable can be changed only every 0.1 s. The control variable 
acts directly on the acceleration of the car (eqn. (27), appendix) but can only 
assume two extreme values (full acceleration or full deceleration). The reward 
signal is defined in such a way that the infinite horizon optimal control policy is 
a minimum time control strategy (eqn. (29), appendix). 

Our test protocol uses an “off-line” learning strategy. First, samples of four- 
tuples are generated from fixed initial conditions and random walk in the control 
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space. Then these samples are used to infer control strategies according to the 
proposed method. Finally these control strategies are assessed. 

3.1 Four- Tuples Generation 

To collect the samples of four-tuples, we observed a number of episodes of the 
system. All episodes start from the same initial state corresponding to the car 
stopped at the bottom of the valley (i.e. {p,s) = (—0.5,0)) and stop when the 
car leaves the region of the state space depicted in Figure 3a. In each episode, the 
action Ut at each time step is chosen at random with equal probability among its 
two possible values u = —4 and m = 4. We will consider hereafter three different 
samples of four-tuples denoted by and Ti, containing respectively the 

four-tuples obtained after 1000, 300, and 100 episodes. These samples are such 
that = 58089, #.7^2 = 18010, and = 5930. Note also that after 100 
episodes the reward r{xt,ut,wt) = 1 (corresponding to the goal state at the top 
of the hill) has been observed only 1 time, 5 times after 300 episodes, and 18 
times after 1000 episodes. 

3.2 Experiments 

To illustrate the behavior of the algorithm, we first use the sample with Extra- 
trees^ As the action space is binary, we choose to separately model the functions 
Qiv(a;, — 4) and Qn{x,A) by two ensembles of 50 Extra-trees. The policy 
obtained is represented in Figure la. Black bullets represent states for which 
Qi{x,—A) > Qi(a;,4), white bullets states for which Qi(x,— 4) < Qi(x,4), and 
grey bullets states for which Qi(x,— 4) = Qi(x,4). Successive policies pL*^^ for 
increasing N are given on Figures Ib-lf. After 50 iterations has almost 
stabilized. 

To associate a score to each policy we define a set X' : X' = {(p, s) € 
X\3i,j G Z|(s,p) = (0.125 * i, 0.375j)} and estimate the value of when 

Px{xo) = if xq G X' and 0 otherwise. The evolution of the score for in- 
creasing N is represented in Figure 2a for the three learning algorithms. With 
Bagging and Extra-trees, we average 50 trees. After about 20 episodes, the score 
does not improve anymore. Comparing the three supervised learning algorithms, 
it is clear that bagging and Extra-trees are superior to single regression trees. 
Bagging and Extra-trees are very close to each other but the score of Extra-trees 
grows faster and is also slightly more stable. 

On Figure 2b, we compare score curves corresponding to the three different 
sample sizes (with Extra-trees). As expected, we observe that a decrease of the 
number of four-tuples decreases the score. 

To give a better idea of the quality of the control strategy induced by our 
algorithm, it would be interesting to compare it with the optimal one. Although it 
is difficult to determine analytically the optimal control policy for this problem, 
it is however possible to determine when the probability distribution on 
the initial states is such that Px{xo = x) = 1 if x corresponds to the state 

The results with regression trees and tree bagging are discussed afterwards. 
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Fig. 1. Representation of for dilferent values of N. Sample 




(p,s) = (—0.5,0) and 0 otherwise. This is achieved by exhaustive search, trying 
out all possible control sequences of length k when the system initial state is 
{p, s) = (—0.5, 0) and determining the smallest value of k for which there is a 
control sequence that leads the car on the top of the hill. From this procedure, 
we find a minimum value of fc = 19 and then = 0.397214 (= 7 ^“^). If 
we use from the same initial state the policy learned by our algorithm (with 
Extra-trees), we get: 




Iteratively Extending Time Horizon Reinforcement Learning 105 



- from Tx, = 0.397214 = 7I8 

- from T 2 , J^°° = 0.397214 = 7I8 

- from 1^3, jQ°° = 0.358486 = 7^° 

For the two largest samples, is equal to the optimum value while it is 

only slightly inferior in the case of the smallest one. 

3.3 Comparison with a Non-model Based Incremental Algorithm 

It is interesting to question whether our proposed algorithm is more efficient in 
terms of learning speed than non-model based iterative reinforcement learning 
algorithms. In an attempt to give an answer to this question we have consid- 
ered the standard Q'ls^rning algorithm with a regular grid as approximation 
architecture^. 

We have used this algorithm during 1000 episodes (the same as the ones used 
to generate Ti) and then we have extracted from the resulting approximate Q- 
function the policy /t* and computed when considering the same probability 
distribution on the initial states as the one used to compute the values of 
represented on Figures 2a-b. The highest value of so obtained by repeating 
the process for different grid sizes (a 10 x 10, a 11 x 11, • • • and a 100 x 100 grid) 
is 0.039 (which occurs for a 13 x 13 grid). This value is quite small compared 
to = 0.295 obtained when using as sample and the Extra-trees as 

regression method (Figure 2a). Even when using ten times more (i.e. 10,000) 
episodes with the Q-learning algorithm, the highest value of obtained over 
the different grids is still inferior (it is equal to 0.232 and occurs for a 24 x 24 
grid). 

4 Discussion and Conclusions 

We have presented a novel way of using batch mode supervised learning algo- 
rithms efficiently in the context of non-model based reinforcement learning. The 
resulting algorithm is fully autonomous and has been applied to an illustrative 
problem where it worked very well. 

Probably the most important feature of this algorithm is that it can scale 
very easily to high dimensional problems (e.g. problems with a large number 
of input variables and continuous control spaces) by taking advantage of recent 
advances of supervised learning techniques in this direction. This feature can 
for example be exploited to handle more easily partially observable problems, 
where it is necessary to use as inputs a history of observations rather than just 
the current state. It could also be exploited to carry out reinforcement learning 
based on perceptual input information (tactile sensors, images, sounds) without 
requiring complex pre-processing. 

^ The degree of correction a used in the algorithm has been chosen equal to 0.1 and the 
Q-function has been initialized to zero everywhere at the beginning of the learning. 
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Although we believe that our approach to reinforcement learning is very 
promising, there are still many open questions. In the formulation of our al- 
gorithm, we have not made any assumption about the way the four-tuples are 
generated. However, the quality of the induced control policy depends obviously 
on the sampling mechanism. So, an interesting future research direction is the 
determination for a given problem of the smallest possible (for computational 
efficiency reasons) sample of four-tuples that gives a near optimal control policy. 
This will raise the related question of how to interact at best with a system 
so as to generate a good sample of four-tuples. One very interesting property 
of our algorithm is that these questions are decoupled from the question of the 
determination of the optimal control policy from a given sample of four-tuples. 

Appendix: Precise Definition 

of the “Car on the Hill” Control Problem 
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Fig. 3. The “Car on the Hill” control problem 



System dynamics: The system has a continuous-time dynamics described by 
these two differential equations: 



p = s 

u 

m{l + H'{pY) 



gH'jp) s^H'{p)H"{p) 
1 -h h'{pY 1 -h h'{pY 



(26) 

(27) 



where m and g are parameters equal respectively to 1 and 9.81 and where H{p) 
is a function of p defined by the following expression: 



H{p) 



fp^+p 



if p < 0 
if p > 0 



(28) 



The discrete-time dynamics is obtained by discretizing the time with the time 
between t and t+\ chosen equal to 0.100 s. 
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If pt+i and St+i are such that |pt+i| > 1 or |st+i| > 3 then a terminal state 
is reached. 

State space: The state space X is composed of {(p, s) G |s| < 1 and \p\ < 3} 
and of a terminal state x^. X\ {x^} is represented on Figure 3a. 

Action space: The action space U = {—4,4} 

Reward function: The reward function r{x,u,w) is defined through the fol- 
lowing expression: 



Decay factor: The decay factor 7 has been chosen equal to 0.95. Notice that 
in this particular problem the value of 7 actually does not influence the optimal 
control policy. 
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Abstract.Receiver Operating Characteristic (ROC) analysis has been success- 
fully applied to classifier problems with two classes. The Area Under the ROC 
Curve (AUC) has been elected as a better way to evaluate classifiers than pre- 
dictive accuracy or error and has also recently used for evaluating probability 
estimators. However, the extension of the Area Under the ROC Curve for more 
than two classes has not been addressed to date, because of the complexity and 
elusiveness of its precise definition. Some approximations to the real AUC are 
used without an exact appraisal of their quality. In this paper, we present the 
real extension to the Area Under the ROC Curve in the form of the Volume 
Under the ROC Surface (VUS), showing how to compute the polytope that cor- 
responds to the absence of classifiers (given only by the trivial classifiers), to 
the best classifier and to whatever set of classifiers. We compare the real VUS 
with “approximations” or “extensions” of the AUC for more than two classes. 



1 Introduction 

In general, classifiers are used to make predictions for decision support. Since predic- 
tions can be wrong, it is important to know what the effect is when the predictions are 
incorrect. In many situations not every error has the same consequences. Some errors 
have greater cost than others, especially in diagnosis. For instance, a wrong diagnosis 
or treatment can have different cost and dangers depending on which kind of mistake 
has been done. In fact, it is usually the case that misclassifications of minority classes 
into majority classes (e.g. predicting that a system is safe when it is not) have greater 
costs than misclassifications of majority classes into minority classes (e.g. predicting 
that a system is not safe when it actually is). Obviously, the costs of each misclassifi- 
cation are problem dependent, but it is almost never the case that they would be uni- 
form for a single problem. Consequently, accuracy is not generally the best way to 
evaluate the quality of a classifier or a learning algorithm. 

Cost-sensitive learning [14] is a more realistic generalisation of predictive learn- 
ing, and cost-sensitive models allow for a better and wiser decision making. The 
quality of a model is measured in terms of cost minimisation rather than error mini- 
misation. When cost matrices are provided a priori, i.e. before learning takes place, 
the matrices have to he fully exploited to obtain models that minimise cost. 
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However, in many circumstances, costs are not known a priori or the models are 
just there to be evaluated or chosen. Receiver Operating Characteristic (ROC) analy- 
sis [5] [9] [13] has been proven to be very useful for evaluating given classifiers in 
these cases, when the cost matrix was not known at the moment the classifiers were 
constructed. ROC analysis provides tools to select a set of classifiers that will behave 
optimally and reject some other useless classifiers. In order to do this, the convex hull 
of all the classifiers is constructed, giving a “curve” (a convex polygon). 

In the simplest case, a single 2-class classifier forms a 4-segment ROC curve (a 
polygon in a strict sense) with the point given by the classifier, two trivial classifiers 
(the classifier that always predicts class 0 and the classifier that always predicts class 
1) and the origin, whose area can be computed. This area is called the Area Under the 
ROC Curve (AUC) and has become a better alternative than accuracy (or error), for 
evaluating classifiers. AUC is also used for probabilistic estimators, where these 
estimations are used where ranking prediction is important [10]. 

ROC analysis and the AUC measure have been extensively used in the area of 
medical decision making [7][15], in the field of knowledge discovery, data mining, 
pattern recognition [1] and science in general [13]. However, the applicability of 
ROC analysis and the AUC has only been shown for problems with two classes. Al- 
though ROC analysis can be extended in theory for multi-dimensional problems [12], 
practical issues (computational complexity and representational comprehensibility, 
especially) preclude its use in practice. The major hindrance is the high dimensional- 
ity. A confusion matrix obtained from a problem of c classes has c positions, and 
(c-(c-l)) dimensions {d), i.e. all possible misclassification combinations are needed. 

Nonetheless, although difficult, it is possible to perform ROC analysis for more 
than two classes and to compute the AUC (more precisely, the Volume Under the 
ROC Surface, VUS). However, the trivial classifiers for more than two classes, the 
minimum and maximum volume have not been identified to date in the literature. 

In this paper, we present the trivial classifiers, the equations, the maximum and 
minimum VUS, for classifiers of more than 2 classes. We use this to compute the real 
VUS for any classifier by the use of a Hyperpolyhedron Search Algorithm (HSA) 
[11]. We then compare experimentally the real VUS with several other (and new) 
approximations, showing which approximation is best. 



2 ROC Analysis 

The Receiver Operating Characteristic (ROC) analysis [5][9][13] allows the evalua- 
tion of classifier performance in a more independent and complete way than just 
using accuracy. ROC analysis has usually been presented for two classes, because it is 
easy to define, to interpret and it is computationally feasible. 

ROC analysis for two classes is based on plotting the true-positive rate (TPR) on 
the y-axis and the false-positive rate (FPR) on the v-axis, giving a point for each clas- 
sifier. A “curve” is obtained because we can obtain infinitely many derived classifiers 
along the segment that connects two classifiers just by voting them with different 
weights. Hence, any point below that segment will have greater cost for any class 
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distribution and cost matrix, because it has lower TPR and/or higher FPR. According 
to this, given several classifiers, one can discard the classifiers that fall under the 
convex hull formed by the points representing the classifiers and the points (0,0) and 
(1,1), which represent the default classifiers always predicting negative and positive, 
respectively. A detailed description of ROC analysis can be found in [5] [9]. 

The usual way to represent the ROC space is not, in our opinion, a very coherent 
way, since the true class is represented incrementally for correct predictions and the 
false class is represented incrementally for incorrect predictions. Moreover this choice 
is not easily extensible for more than two classes. Instead, we propose to represent the 
false-negative-rate (FNR) and the FPR. Now, the points (0,1) and (1,0) represent, 
respectively, the classifier that classifies anything as negative and the classifier that 
classifies anything as positive. The curve is now computed with points (0,1), (1,0) 
and (1,1). 

Obviously, with this new diagram, instead of looking for the maximisation of the 
Area Under the ROC Curve (AUC) we have to look for its minimisation. A better 
option is to compute the Area Above the ROC Curve (AAC). In order to maintain 
accordance with classical terminology, we will refer to the AAC also as AUC. 



3 Multi-class ROC Analysis 

Srinivasan has shown in [12] that, theoretically, the ROC analysis extends to more 
than two classes “directly”. For c classes, and assuming a normalised cost matrix, we 
have to construct a vector of d = c(c-l) dimensions for each classifier. In general the 
cost of a classifier for c classes is: 

Cost= 

where R is the confusion ratio matrix (each column normalised to sum 1), C is the 
cost matrix, and p{i) is the absolute frequency of class i. From the previous formula, 
two classifiers 1 and 2 will have the same cost when they are on the same iso- 
performance hyperplane. However, the d-\ values of the hyperplane are not so 
straightforward and easy to obtain and understand as the slope value of the bi- 
dimensional case. 

In the same way as the bi-dimensional case, the convex hull can be constructed, 
forming a polytope. To know whether a classifier can be rejected, it has to be seen 
whether the intersection of the current polytope with the polytope of the new classi- 
fier gives the new polytope, i.e., the new polytope is included in the first polytope [8]. 
Provided this direct theoretical extension, there are some problems. 

- In two dimensions, doubling the probability of one class has a direct counterpart in 
costs. This is not so for d>2, because there are many more degrees of freedom. 

- The best algorithm for the convex hull of N points is 0{N log N -i- N^) [8] [3]. 

- In the 2-d case, it is relatively straightforward how to detect the trivial classifiers 
and the points for the minimum and maximum cases. 

However, not only there are computational limitations but also representational ones. 
ROC analysis in two dimensions has a very nice and understandable representation. 
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but it cannot be directly extended to more than two classes, because even for 3 classes 
we have a 6D space, quite difficult to be represented. In what follows, we illustrate 
the extension for 3 classes, although the expressions can he generalised easily. 



3.1 Extending ROC Analysis for 3 Classes 

In this part we consider the extension of ROC analysis for 3-class problems. In this 
context we consider the following cost ratio matrix for three-class classifiers: 

a 

Predicted b 
c 

This gives a 6-dimensional point (Xj, x^, x^, x^, x^, x^. The values h^, and are de- 
pendent and do not need to he represented, because: 

h^ + x^ + x^ = /jj -P Xj H- Xj = 1 , h^ + x^ + x^ = \ 



Actual 

a b c 




3.1.1 Maximum VUS for 3 Classes 

Let us begin by considering the maximum volume. The maximum volume should 
represent the volume containing all the possible classifiers. A point in the 1-long 
hypercuhe is a classifier if and only if: 

Xj -p X 5 < 1 , X, -P Xg < 1 , Xj -P x^ < 1 

It is easy to obtain the volume of the space determined by these equations, just by 
using the probability that 6 random numbers under a uniform distribution U(0,1) 
would follow the previous conditions. More precisely: 

VUS ™ = P(U(0,1) -P U(0,1) < 1) ■ P(U(0,1) -P U(0,1) < 1) ■ P(U(0,1) -P U(0,1) < 1) 

= [P(U(0,l)-pUf0,l)< l)]^ 

It is easy to see that the probability that the sum of two random numbers under the 
distribution U(0,1) is less than 1 is exactly V 2 , i.e: 

P(U(0,1) -P U(0,1) < 1) = 1 / 2 , consequently VUS,”" = {Vif = 1/8 
We have also considered the maximum VUS for c classes. It is easy to see that the 
volume of the space determined by valid equations for c classes is: 

VUS"“ = n, [ P(Z,._,U(0,1) < 1) ] = [P(I^_jU(0,l) < 1)]' . 

However, the probability that the sum of c-1 random numbers under the distribution 
U(0,1) is less than 1 is difficult to be obtained. In particular, the probability density 
function of the sum of n uniform variables on the interval [ 0 , 1 ] can be obtained using 
the characteristic function of the uniform distribution. 



d„(x) = F- 



i-cosf-Psmf 



1 



2(n-l)!S 



? \(x-k)" ‘ sgn(x-fc) 



Using the cumulative distribution function DJ^x), we have that the probability that the 
sum of n random numbers with U(0,1) is less than 1 is: 
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For n=l we have Dj(l)= 1 , for n =2 we have D2(l)= Vi, for n= 3 , 03(1)= 1/6 and: 

vusr =(A,_,(1))' 

And hence we have, ¥083"“= 1 , ¥083"“= 1/8 and ¥U8^"“= 1 / 1296 . However, for 
n> 3 , ZJ^is complex. For such cases, we can approximate the sum of n random num- 
bers under the distribution U( 0 , 1 ) with a single variable (F) under the normal distribu- 
tion with \i=n !2 and 0= n /12 using the central limit theorem. Then: 



f 
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Where Z is a standard normal distribution variable. Therefore, when c> 3 : 
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3.1.2 Minimum ¥US for 3 Classes 

Now let us try to derive the minimum ¥U8. Without any knowledge we can construct 
trivial classifiers by giving more or less probability to each class, as follows: 

Actual 

a B c 

a 

Predicted b 

c 

where = 1 . These obviously include the three extreme trivial classifiers 

“everything is a”, “everything is b” and “everything is c”. Given a classifier: 

Actual 

ABC 
a 

Predicted b 

c 

we can discard this classifier if and only if it is above a trivial classifier, formally: 
3 h^,h^,h^ G where (h^ + hi^ + h^= 1) such that: 

V. > h ,v >h,v>h.,v>h.,v > h ,v. > h 
From here, we can derive the following theorem (see [ 4 ] for the proof): 

Theorem l:Without any knowledge, a classifier (v,, x^, x,, x^, x^, x^) can be discarded 
iff: r^ + r^ + r^>\ where r, = min(Xj, x^), = min(x3, xj and = min(x5, x^). 

Given the previous property, we only have to compute the space of classifiers that 
follow the condition that r^ + r^ + rj> 1 where = min(Xj, x^), = min(x3, x^) and = 

min(x5, Xg) to obtain the minimum volume corresponding to total absence of informa- 
tion. More precisely, we have to compute the volume formed by this condition jointly 
with the valid classifier conditions, i.e.: 

X3 H- X5 < 1, Xj H- Xg < 1, X3 H- x^ < 1 and r^+ r^+ ^3 > 1 
where = min(X|, x^), = min(x3, X4) and = min(x5, Xg) 
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This volume is more difficult to be obtained by a probability estimation, due to the 
min function and especially because the first conditions and the last one are depend- 
ent. Let us compute this volume using a Monte Carlo method. 

3.1.3 Monte Carlo Method for Obtaining Max and Min VUS 

Monte Carlo methods are used to randomly generate a subset of cases from a problem 
space and estimate the probability that a random case follows a set of conditions. 
These methods are particularly interesting to approximate volumes, such as the vol- 
ume under the ROC curve we are dealing with. 

For this purpose, we generate an increasing number of points in the 6D hypercube 
of length 1 (i.e., we generate six variables x,, x^, Xj, x^, x,, x^ using a uniform distribu- 
tion between 0 and 1) and then check whether or not they follow the previous maxi- 
mum or minimum conditions. Since we are working with a 1 -length hypercube, the 
proportion of cases following the conditions is exactly the volume we are looking for. 
In particular, we have obtained the following results: 

- Maximum: 0.12483 for 1,000,000 cases, matching our theoretical VUSj"’”' = 1/8. 

- Minimum: 0.00555523 for 1,000,000 cases, which is approximately 1/180. 
However, although we have obtained the exact maximum, we have not obtained the 
exact minimum (although 1/180 is conjectured). In the next section we introduce a 
method to compute the real VUS™”, and, more importantly, to obtain the ROC poly- 
topes that form these volumes. 



4 A Constraint Satisfaction Algorithm for the ROC Polytopes 

In the previous section we have developed the conditions for the maximum and mini- 
mum VUS, given, respectively, when the best classifier is known (0, 0, 0, 0, 0, 0) and 
when no classifier is given (absence of information). However, we are interested in a 
way to obtain the border points of each space, i.e., the polytopes that represent both 
cases. What we need is a way to compute these polytopes given the set of conditions. 
A general system able to do this is HSA. 



4.1 Hyperpolyhedron Search Algorithm (HSA) 

In the constraint satisfaction literature, researchers have focussed on discrete and 
binary Constraint Satisfaction Problems (CSPs). However, many real problems (as 
the ROC surface problem) can be naturally modelled using non-binary constraints 
over continuous variables. Hyperpolyhedron Search Algorithm (HSA) [11] is a CSP 
solver that manages non-binary and continuous problems. HSA carries out the search 
through a hyperpolyhedron that maintains in its vertices those solutions that satisfy all 
non-binary constraints. The handling of the non-binary constraints (linear inequa- 
tions) can be seen as the handling of global hyperpolyhedron constraints. Initially, the 
hyperpolyhedron is created by the Cartesian product of the variable domain bounds. 
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For each constraint, FISA checks the consistency, updating the hyperpolyhedron 
through linear programming techniques. Each constraint is a hyperplane that is inter- 
sected to obtain the new hyperpolyhedron vertices. The resulting hyperpolyhedron is 
a convex set of solutions to the CSP. A solution is an assignment of a value from its 
domain to every variable where all constraints are satisfied. HSA can determine: 
whether a solution exists (consistency), several solutions or the extreme solutions. 

In the ROC surface problem, we will use HSA to determine the extreme solutions 
in order to calculate the convex hull of the resulting hyperpolyhedron. HSA does not 
compute the volume of the hyperpolyhedron. For this purpose, we are using QHull 
[2]. QHull is, among other things, an algorithm that implements a quick method for 
computing the convex hull of a set of points and the volume of the hull. 

4.2 Maximum VUS Points for 3 Classes 

Let us recover the equations for the maximum volume (valid classifier conditions): 

X3 H- X5 < 1 , Xj H- Xg < l,x^ + x^< 1 

We introduce these equations to HSA and look for solutions for these six variables. 
We obtain 41 points (which can be simplified into just 27 points, see [4]) whose vol- 
ume is, as expected, 0.125 (1/8). 

4.3 Minimum VUS for 3 Classes 

From Theorem 1, in order to compute the minimum VUS, we only have to compute 
the space of classifiers following that r, + + r^> 1 where r, = min(Xj, x^), = min(Xj, 

X4) and = min(x5, x^) to obtain the minimum volume corresponding to total absence 
of information. Using this condition and the hyper-cube conditions, we have: 

X3 H- X5 < 1 , Xj H- Xg < 1 , Xj -H X4 < 1 , r^+ r^ + r^> 1 

where = min(x,, Xj), = min(x3, x^) and = min(x5, x^). Since the min function is 
not handled by HSA, we convert the last condition into eight equivalent inequations: 

Xj H- X3 H- X5 > 1 , Xj H- X3 H- Xg > 1 , Xj H- X4 H- X5 > 1 , Xj H- X^ H- Xg > 1 , 

Xj H- X3 H- X5 > 1 , Xj H- X3 H- Xg > 1 , Xj H- X4 H- X, > 1 , Xj H- X4 H- Xg > 1 

and now we obtain a set of 25 points whose volume is 0.0055, which is approxi- 
mately 1/180 and matches the volume obtained by the Monte Carlo method. 

Some of these points are exactly on the surface of the volume and can be removed 
without modifying the volume in a simplified set of 9 points (see [4]). 

4.4 Computing the VUS of Any Classifier 

Now it seems that we can obtain the VUS of any classifier just be adding the coordi- 
nates of the point it represents and adding them as a new point to the minimum and 
then computing the convex hull. However, this would be a hasty step. The surprise 
would come up if we take the minimum (9 points, 1/180) and add the origin (the best 
classifier with no error at all). In this case, we obtain 10 points and 1/120 volume, 
which is a greater volume but it is not the maximum. This seems contradictory, be- 
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cause if we have the best classifier, we should obtain the maximum volume. The 
reason is the following. When we have the perfect classifier, represented by the point 
(0, 0, 0, 0, 0, 0), any classifier that has a value equal or greater than 0 in any coordi- 
nate is discardable and, logically, this should give 1/8, not 1/120. The issue is that 
whenever we add a new classifier we have to consider the conditions it produces, 
which are polytopes, not just points. 

In other words, the perfect classifier generates the following discard equations: 

x^ > 0, Xj > 0, Xj > 0, X4 > 0, X5 > 0, Xg > 0 

These inequations are null, because all the values should be positive, and, hence, we 
only have the valid classifier conditions, and then we have the maximum volume 1/8. 
Now let us consider the same thing for any arbitrary classifier Cl: 

Actual 

a b c 

a 

Predicted b 

c 

What can be discarded? The answer is that any classifier such that is worse than the 
classifier Cl (combined with the trivial classifiers), i.e., any classifier that would have 
greater values for the 6 dimensions. Consequently, given a new classifier C2: 

Actual 

a b c 

a 

Predicted b 

c 

We have to look at all the classifiers constructed as a linear combination of the three 
trivial classifiers and the classifier Cl, and see whether C2 is worse than any of the 
constructed classifiers. Formally, the linear combination is defined as: 

/t„- (1, 1, 0, 0, 0, 0) -P /t,- (0, 0, 1, 1, 0, 0) -P h^- (0, 0, 0, 0, 1, 1) -P hy {z,^, z„, zj 

And we can discard C2 when 

e R* where (Ji^ + h^ + h^ + hj = 1) such that: 

+ ^ + ^ + v,,^K + 0 + 0 + hy z„ v^,>0 + h, + 0 + h,- 

+ + O + + 0 + h^ + hyz^,v,^>0 + 0 + h^+hy z,. 

This gives a system of inequations with 10 variables (z,jare constants given by Cl), 
that can be input to HSA, and then we obtain the edge six dimensions points from v.j. 

4.5 Real VUS for More than One Classifier 

In the same way as before, given a set of classifiers, we can compute the true VUS of 
the set, just generalising the previous formula. Let us illustrate it for 4 classifiers Z, 
W, Y and X. In fact, what we have to do is to consider the linear combination of the 
three trivial classifiers with the four given classifiers, i.e.: 

(1, 1, 0, 0, 0, 0) -P (0, 0, 1, 1, 0, 0) -p h^- (0, 0, 0, 0, 1, 1)-P h,- (z,^, z^, z^, z,„ z„„ zj 
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And now we can discard when: 

3h^,hi^h^hj e R* where {h^ + hi^ + h^ + + h^ + h^ + h^ = 1) such that: 

V, >h + 0 + 0 + /t, • z, + h • w, + /t, • X, + /i . ■ w 

ii £2 £2 ] ^ba 2 ba 3 ba 4 y ba, 

h^ + 0 + 0 + hj ■ hj ■ 

'’^^0 + h^ + O + h,- z^,+ ■ w^+ h, ■ x„,+ \ 

^ 0 + /i, + 0 + /!, • z^,+ /t, • w^,+ h, ■ x„+ h, ■ 
v^^>0 + 0 + h^ + hj ■ z^^+ ■ w^ + hj ■ x^+ 

v,^>0 + Q + \ + h,- z,^ + /t, • w, + h, ■ x^ + ■ y^^ 

This gives a system with 9+4 variables that can be solved by HSA, from which we 
again retain just 6 (v,j) variables to obtain the polytope. 



5 Evaluation of Multi-class Approximations to the VUS 

In the previous section we have developed a method (conditions + HSA) to obtain the 
real VUS of any classifier for an arbitrary number of classes (the extension for more 
than 3 classes is trivial). However, this exact computation, although quite efficient for 
3 and 4 classes, must be impractical for a higher number of classes or classifiers. 

In the literature, there have been several approximations for the extension of the 
AUC measure for multi-class problems, either based on the interpretation of the AUC 
as distribution separability [6] or the meaning of the equivalent (for two classes) Wil- 
coxon statistic or GINI coefficient. However, there is no appraisal or estimation, 
either theoretical or practical, of how good they are. 

In this section we gather and remind the approximations for the AUC for more 
than two classes known to date: macro-average, 1-point trivial AUC extension and 
some Hand & Till [6] variants. We are going to make a comparison with the real 
measure we have presented in this work: the exact VUS (through the HSA method). 

We will give the definitions for three classes, although this can be easily extended 
to more classes. For the following definitions consider a classifier C2 as before. 



5.1 Macro-average 

The macro-average is just defined as the average of the class accuracies, i.e.: 

MAVG,= (v^ + v,,+ vj / 3 

This measure has been used as a very simple way to handle more appropriately un- 
balanced datasets (without using ROC analysis). 



5.2 Macro-average Modified 

We modify the original definition of macro-average because this does not consider 
the standard deviation between the points. For instance, using two classes, the point 
(0.2, 0.2) has greater AUC than the point (0.1, 0.3) although both points have identi- 
cal macro-average. Thereby, we will employ the generalised mean instead of average: 
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MAVG3-M0D= (ly 

The best value for t between the arithmetic mean (t=l) and the geometric mean (f*0) 
has been estimated experimentally. The value t=0.76 obtains the best performance. 



5.3 1-Point Trivial AUC Extension 

Going back to two classes, the area for one point (v^^, (in our representation) is: 

AUCj = max(l/2, 1 - /2 - 12) 

Extending trivially the previous formula, we have this extension for 1 -point: 

AUC-IPT3 = max(l/3, 1 - - 1 - - 1 - v^,-l- v., - 1 - v,^.) /3 

This extension turns to be equal to the macro-average since the columns of the matrix 
sum to 1. The only difference is that the IPTj measure is never lower than 1/3. 



5.4 1-Point Hand and Till Extension 



Hand and Till have presented a generalisation of the AUC measure [6] for soft classi- 
fiers, i.e., classifiers that assign a different score, reliability or probability with each 
prediction. Although we will deal with soft classifiers later, let us adapt Hand and 
Till’s formulation for crisp classifiers, i.e., classifiers that predict one of the possible 
classes, without giving any additional information about the reliability or probability 
of the predicted class, or the other classes. 

Hand and Till’s extension for more than two classes is based on the idea that if we 
can compute the AUC for two classes i,j (let us denote this by then we can 

compute an extension of AUC for any arbitrary number of classes by choosing all the 
possible pairs (1 vs. 1). Since A{i,j) = A(j,i), this can be simplified as shown in the 
following Hand and Till’s M function: 



M =- 



1 



-YMi, j) = — X Mi, j) 



c{c-l)i^j c(c-l)‘<3 

Pursuing this idea we are going to introduce three variants. The first variant is given if 
we consider the macro-average extension. Then we have: 



HTla = (max(l/2, + v^^)l2) + max (1/2, + v^)!2) + max(l/2, (v^^ - 1 - v^)!2 ) ) / 3 

This is equal to the IPT. But if we take failures into account instead of hits, we have: 



HTlb= (max(l/2 , l-(v,^-l-vJ/2) -t max(l/2, l-(v^„-l-vj /2) - 1 - max(l/2, -l-vj /2)) / 3 

This measure is slightly different from the previous ones and we will use this one. 
Another different way is normalisation, e.g., if we normalise only for classes a and b\ 



Actual 

A b 





x= V / (v -1- V J 

ba ' ba bb' 




v..l(v^ + vA 

bb ' ba bb' 



We have max(l/2, {\+y)/2) and the same for the rest of combinations. Namely: 

HT2= (max(l/2 , 1 - (v^„ / (v^ + vj + / (v ^ -t vJ)/2) + max(l/2, 1 - (v„ / (v ^ -t vj -t 

/ (v„ + vJ) /2) -t max(l/2, 1 - (v , / (v ^ -t v„) -t / (v,, -t /2)) / 3 
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Finally, we can define a third variant that instead of computing partial AUCs of pairs 
of classes, computes the AUC of each class against the rest (1 vs. rest) and then aver- 
age the results. For instance, the AUC of class a and the rest {b and c joined) will be 
obtained from a condensed 2x2 matrix: 



A 

Predicted 

rest 

Using cells (a,rest) and (rest,a) we have: 

AUC.„„= max(l/2, 1 - [(v„,+ vj / (v„ + vj]l2 - [ (v^ + v„) / (v,„ + v,, + v,, + + v„)]/2 

In the same way we can obtain AUC^,,^^, and AUC^,,^^,. This allows us to define FIT3: 
HT3= (AUCfl^resf + AUCfl^resf + AUCfl,rt.sf )/ 3 



Actual 

A rest 



V / (v + V V ) 

aa ' aa ab ac' 


(v^ + V )! (v^ + V + + V ^ + V ) 

' ba ca' 'Da ca+ bb be cb ca 


{v V ) / (v + V .+ V ) 

' ab aa ' aa ab aa 


+ V,. + V ^ + V ) 1 (v^ + V V. + V, + v,+ vj 

' bb be eb ca ' ba ca+ bb be cb ca 



5.5 Experimental Evaluation 

Once the previous approximations are presented, we are ready to evaluate them in 
comparison to the exact computation given by the HSA method. 

We are interested in how well the approximations “rank” the classifiers. To evalu- 
ate which approximation is best, we generate a set of 100 random classifiers (more 
precisely, we randomly generate 100 normalised confusion matrices). 

Then, we compute the value of each classifier for each approximation a (exact 
VUS, accuracy, macro-avg, mod-avb, 1-p trivial, FITIB, HT2, HT3). Next, we make 
a one-to-one comparison (a ranking) for each approximation a and fill a matrix M^, 
which tells whether i is ranked above j. Done all this (for a detailed description of the 
methodology of this process, see [4]), we are ready to compare approximations. 

For instance, given the matrices Mj and of two different methods, we compare 
the discrepancy of the matrices in the following way: 

2E 

disc = 

n(n — 1) 

With this formula we can evaluate the discrepancy of the methods for 3 class prob- 
lems with respect the real VUS computed with the FIAS method. The results are: 



Accuracy 


Macro-avg 


Mod-avg (0.76) 


1-p trivial 


HTIB 


HT2 


HT3 


0.08707 


0,087071 


0.0587879 


0.09131 


0.10404 


0.14081 


0.09677 



According to these results, the best approximation is the modified macro-average 
(generalised mean). Note that this is the only one that is better than accuracy. Note 
also that for 2 classes, AUC = geomean(TPR, TNR), while for 3 classes, the best 
result is obtained somehow in the middle between the arithmetic mean and the geo- 
metric mean. This modified mean obtains the lower discrepancy among the studied 
approximations and hence could be used as an alternative to accuracy and macro-avg. 
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6 Conclusions 

In this paper we have addressed the extension of ROC analysis for multi-class prob- 
lems. We have identified the trivial classifiers and then derived the discard condi- 
tions, identified the maximum and minimum VUS and their polytopes, as well as the 
VUS for any arbitrary set of crisp classifiers. This is computed through the HSA 
algorithm. We have then compared experimentally the real VUS with several other 
approximations for crisp classifiers, showing which approximation is best. The best 
approximation seems to be a modification of the macro-average for one classifier. 

For soft classifiers (i.e., classifiers that accompany each prediction with the reli- 
ability or, even better, with the estimated probabilities of each class), we have per- 
formed some preliminary experiments (see [4]) which show that the best approxima- 
tion for soft classifiers is HT3. It is precisely for soft classifiers where the results can 
be more directly applicable to real-world problems. 

For the moment, the results of this work dissuade the use of Hand and Till’s and 
related measures as an extension of AUC for more than two classes for one crisp 
classifier. We propose an alternative approximation (mod-average). Nonetheless, for 
the case of soft classifiers the preliminary results in [4] are good for Hand and Till’s 
extension (1 vs. 1, i.e. HT2) but especially for Fawcett’s extension (1 vs. rest, i.e. 
HT3) already used in [9][10] for sets of classifiers or soft classifiers. Pursuing the 
work initiated here will bring a more justified use of AUC extensions as evaluation 
measure for classifiers. As future work, we would like to work further on soft classi- 
fiers, deriving accurate approximations of the real VUS in a reasonable time. 
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Abstract. In this work we investigate several issues in order to improve the per- 
formance of probabilistic estimation trees (PETs). First, we derive a new prob- 
ability smoothing that takes into account the class distributions of all the nodes 
from the root to each leaf. Secondly, we introduce or adapt some new splitting 
criteria aimed at improving probability estimates rather than improving classifi- 
cation accuracy, and compare them with other accuracy-aimed splitting criteria. 
Thirdly, we analyse the effect of pruning methods and we choose a cardinality- 
based pruning, which is able to significantly reduce the size of the trees without 
degrading the quality of the estimates. The quality of probability estimates of 
these three issues is evaluated by the 1-vs-l multi-class extension of the Area 
Under the ROC Curve (AUC) measure, which is becoming widespread for 
evaluating probability estimators, ranking of predictions in particular. 



1 Introduction 

Decision-tree learning has been extensively used in many application areas of ma- 
chine learning, especially for classification, because the algorithms developed for 
learning decision trees [3,13] represent a good compromise between comprehensi- 
bility, accuracy and efficiency. In the common setting, a classifier is defined as a 
function from a set of m arguments or attributes (which can be either nominal or nu- 
meric) to a single nominal value, known as the class. We denote by C the set of c 
classes, usually simply referred by natural numbers 0, 1, 2, ... c-1. By E we denote the 
set of unlabelled examples. A classifier is a function f: E ^ C. Traditionally, this 
setting was sufficient for most of the classification problems and applications. How- 
ever, more and more applications require some kind of reliability, likelihood or nu- 
meric assessment of the quality of each classification. In other words, we do not only 
want that the model predicts a class value for each example but also that it can give an 
estimate of the reliability of each prediction. Such classifiers are usually called soft 
classifiers. Soft classifiers are useful in many scenarios, including combination of 
classifiers, cost-sensitive learning and safety-critical applications. The most general 
presentation of a soft classifier is a probability estimator, i.e. a model that estimates 
the probability p.(e) of membership of class ie C for every example ee E. 

A trained decision tree can be easily adapted to be a probability estimator by using 
the absolute class frequencies of each leaf of the tree. For instance, if a node has the 
following absolute frequencies n^, n^, ..., (obtained from the training dataset) the 
estimated probabilities for that node can be derived as p. = nj l,n.. Every new exam- 
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pie falling into that leaf will have these estimated class prohabilities. Such trees are 
called Probability Estimation Trees (PETs). However, despite this simple conversion 
from a decision tree classifier into a PET, the probability estimates obtained by PETs 
are quite poor with respect to other probability estimators [14,2]. 

Some recent works have changed this situation. First, Provost and Domingos [12] 
improve the quality of PETs by reassessing some classical techniques in decision tree 
learning. In particular, they found that frequency smoothing of the leaf probability 
estimates, such as Laplace correction, significantly enhances the estimates, especially 
if they are used for ranking. On the other hand, pruning (or related techniques such as 
C4.5 collapsing) is shown to be unhelpful for increasing probability estimates. Un- 
pruned trees usually give the best results. Independently, in an earlier paper [7] we 
also improve the quality of PETs by considering Laplace correction for the leaves. In 
addition, we showed that splitting criteria aimed at increasing accuracy (or reducing 
error), such as GainRatio, GINI or DKM [13,3,10] are not necessarily the best criteria 
for estimating good probabilities. Splitting criteria based on probability ranking, such 
as the new AUC-splitting criterion [7] can produce better results when the aim is to 
obtain good probability estimates. These two works are first steps that show that deci- 
sion trees can be successfully used as probability estimators, provided we reassess and 
redefine some of the traditional techniques specifically devised for improving the 
accuracy of decision trees. 

Provost and Domingos [12] “believe that a thorough study of what are the best 
methods for PETs would be a successful contribution to machine-learning research”. 
In this spirit and as a sequel and natural continuation of the above-mentioned works, 
in this paper we present the following enhancements: (i) a new smoothing method (m- 
branch smoothing) for estimating probabilities, that not only considers the leaves, but 
all the frequencies from the root to the leaf; (ii) a new splitting criterion (MSEEsplit) 
defined in terms of the minimum squared error of the probability estimates; and (Hi) a 
simple pruning criterion based on the cardinalities of nodes that is able to reduce the 
size of trees, without degrading the quality of the probability estimates. 

The paper is organised as follows. In Section 2 we describe in more detail what a 
PET is and how it can be evaluated. Section 3 presents the new smoothing method. 
Section 4 introduces two new splitting criteria, MAUCsplit and MSEEsplit. Section 5 
analyses the use of pruning and presents the influence of the degree of pruning of the 
best pruning method we have found so far for PETs. Einally, Section 6 discusses the 
results, and Section 7 closes the paper and proposes future work. 



2 PETs, Features and Evaluation 

In this section we present some necessary definitions, the evaluation framework, the 
experimental setting and some previous results in order to set the stage for the rest of 
the paper. The main contributions of this work are presented in subsequent sections. 

Given the set of unlabelled examples E and the set C of c classes, we define a 
probability estimator as a set of c functions p.^^: E ^ such that Vp,g^, eeE : 0 < 
p.{e) < 1 and \/e&E 'Lpisci^)= 1- Decision trees are formed of nodes, splits and condi- 
tions. A condition is any Boolean function g: E ^ {true, false}. A split is a set of s 
conditions : 1 < k < i']. In this paper, we consider the conditions of a split to be 
exhaustive and exclusive, i.e., for a given example one and only one of the conditions 
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of a split is true. A decision tree can be defined recursively as follows: (i) a node with 
no associated split is a decision tree, called a leaf; (ii) a node with an associated split 
{gj : 1 < A: < v} and a set of i children {tj, such that each condition is associated with 
one and only one child, and each child is a decision tree, is also a decision tree. 
Given a tree t there is just a single node r that is not child of any other node. This 
special node is called the root of the tree. The sequence of nodes <Vj, Vj, ..., V^> from 
the root to a leaf /, where = / and v, is the root, is called the branch leading to 1. 

In the most straightforward and classical scenario a decision tree is learned by us- 
ing a training set T, which is a set of labelled examples, i.e., a set of pairs of the form 
<e, i> where ee E and ie C. After the training stage, the examples will have been 
distributed among all the nodes in the tree, where the root node contains all the exam- 
ples and downward nodes contain the subset of examples that are consistent with the 
conditions of the specific branch. Therefore, every node has particular absolute fre- 
quencies n,, ..., n^ for each class. The cardinality of the node is given by l,n.. A 

decision tree classifier (DTC) is defined as a decision tree with an associated labelling 
of the leaves with classes. Usually, the assigned class is the most frequent class in the 
leaf (argmaxfnf). A probability estimation tree (PET) is a decision tree where each 
leaf is assigned a probability distribution over classes. These probability estimates can 
for instance be relative frequencies py= nfLn.. 

2.1 DTCs, PETs and Their Evaluation 

One of the first questions that may arise is whether a good DTC is always a good PET 
and vice versa. Although there is a high correlation between quality of DTCs and 
quality of PETs, some recent works have shown that many heuristics used for improv- 
ing classification accuracy “reduce the quality of probability estimates” [12]. Hence, 
it is worth investigating new heuristics and techniques which are specific to PETs and 
that may have been neglected by previous work in DTCs. 

But first of all, a standard measure for evaluating the quality of PETs must be es- 
tablished. As justified and used by [12,7] and other previous work, the AUC (Area 
under the ROC Curve) measure has been chosen for evaluation. The measure can be 
interpreted as the probability that a randomly chosen example e of class 0 will have an 
estimated pfe) greater than the estimated pfe). Consequently, this is a measure par- 
ticularly suitable for evaluating ranked two-class predictions. Recently, an extension 
of the AUC measure for more than two classes has been proposed by Hand and Till 
[9]. The idea is to simply average the AUC of each pair of classes (1-vs-l multi- 
class). We call this measure MAUC for multi-class AUC (Hand and Till denote the 
function by M). Clearly, MAUC = AUC when c=2. 

In [7] we introduced a new method for efficiently computing MAUC based on the 
ranking of leaves rather than a ranking of examples. Hence, the complexity of the new 
method depends on the number of leaves rather than on the number of examples, 
frequently entailing better performance. In what follows, we use this optimisation. 

2.2 Datasets and Experimental Methodology 

We evaluated the methods presented in this paper on 50 datasets from the UCI reposi- 
tory [1]. Half of them have two classes, either originally or by selecting one of the 
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classes and joining all the other classes, and the rest have more than two classes 
(multi-class datasets). The datasets are described in Table 1 and Table 2. The first two 
columns show the dataset number and name, the size (number of examples), the num- 
bers of nominal and numerical attributes and the size of the minority class. 



Table 1. Two-class datasets used. Table 2. Multi-class datasets used. 



# 


Dataset 


Size 


attributes 

NOM NUM 


%MIN 

CLASS 


# 


Dataset 


#Classes 


Size 


ATTRIBUTES %MIN CLASS 
NOM NUM 


1 


Monks 1 


566 


6 


0 


50 


26 Hypothyroid 3c 


3 


2750 


21 


6 


3.24 


2 


Monks2 


601 


6 


0 


34.28 


27 


Balance-Scale 


3 


625 


0 


4 


7.84 


3 


Monks3 


554 


6 


0 


48.01 


28 


Cars 


4 


1728 


6 


0 


3.76 


4 


Tic-tac 


958 


8 


0 


34.66 


29 


Dermatology 


6 


366 


33 


1 


5.46 


5 


House-votes 


435 


16 


0 


38.62 


30 


New-Thyroid 


3 


215 


0 


5 


13.95 


6 


Agaricus 


8124 


22 


0 


48.2 


31 


Nursery4C 


4 


12957 


8 


0 


2.53 


7 


Breast-wdbc 


569 


0 


30 


37.26 


32 


Page-Blocks 


5 


5473 


0 


10 


0.51 


8 


Breast-can- wise 


699 


0 


9 


34.48 


33 


Pendigits 


10 


10992 


0 


16 


9.60 


9 


Breast-wpbc 


194 


0 


33 


23.71 


34 


Tae 


3 


151 


2 


3 


32.45 


10 


Ionosphere 


351 


0 


34 


35.9 


35 


Iris 


3 


150 


0 


4 


33.33 


11 


Liver-Bupa 


345 


0 


6 


42.03 


36 


Optdigits 


10 


5620 


0 


64 


9.86 


12 


Pima-Ab alone 


768 


0 


8 


34.9 


37 


Segmentation 


7 


2310 


0 


19 


14.29 


13 


Chess-kr-vs-kp 


3196 


36 


0 


47.78 


38 


Wine 


3 


178 


0 


13 


26.97 


14 


Sonar 


208 


0 


60 


46.63 


39 


Heart-Dis-All 


5 


920 


8 


5 


3.04 


15 


Hepatitis 


83 


14 


5 


18.07 


40 


Anneal 


5 


898 


32 


6 


0.89 


16 


Thyroid-hypo 


2012 


19 


6 


6.06 


41 


Hayes-Roth 


3 


160 


4 


0 


19.38 


17 


Thyroid-sick-eu 


2012 


19 


6 


11.83 


42 


Waveform 


3 


5000 


0 


21 


32.94 


18 


Yeast2c 


1484 


0 


8 


31.20 


43 


CMC 


3 


1473 


7 


2 


22.61 


19 


Spect 


267 


22 


0 


20.60 


44 


Ecoli4C 


4 


336 


0 


7 


7.44 


20 


Habermn-Brst 


306 


0 


3 


26.47 


45 Autos-Drvwhls 


3 


205 


9 


16 


4.39 


21 


Spam 


4601 


0 


57 


39.40 


46 


Solar Flarec 


3 


323 


10 


0 


2.17 


22 


Cyl-Bands 


365 


19 


17 


36.99 


47 HorseCoucOutC 


3 


366 


13 


8 


14.21 


23 


Pima-Diabetes 


768 


0 


8 


34.90 


48 


Ann-Thyroid 


3 


7200 


15 


6 


2.31 


24 


Sick 


2751 


21 


6 


7.92 


49 


Splice 


3 


3190 


60 


0 


24.04 


25 


Lymph_2c 


142 


15 


3 


42.96 


50 


Sat 


6 


6435 


0 


36 


9.73 



All experiments have been done within the SMILES system (http : / /www. dsic . 
upv. es/~f lip/ smiles/). The use of the same system for all the methods makes 
comparisons more impartial because all other things remain equal. We used the basic 
configuration of the system, which is a decision-tree learner quite similar to C4.5, but 
without pruning (unless stated), without node “collapsing” [13], and the GainRatio 
splitting criterion used by default (this configuration is sometimes called C4.4). 

We performed a 20 times 5-fold cross-validation, thus making a total of 50 x 100 = 
5,000 runs of SMILES for each method. We have used 5-fold cross-validation instead 
of 10-fold cross-validation because for computing the AUC we need examples of all 
the classes and some datasets have a small proportion of examples for the minority 
class. In what follows, for each dataset we show the arithmetic mean and the standard 
deviation of the 100 runs. Accuracy and AUC are shown as a percentage. 



2.3 Results with Laplace and jn-Estimate Smoothing 

Previously, we have stated that, given any node with absolute frequencies n^, n^, ..., 
for each class (hence overall cardinality 'Ln), we can obtain a probability estimation 
tree by obtaining the probabilities as p. = nj 'Ln... One problem is that pure nodes with 
small cardinality will have the same probability as pure nodes with much higher car- 
dinality. This is especially problematic for ranking predictions of unpruned trees. 
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because most or all nodes tend to be pure and there are many ties between the rank- 
ings. A common solution to this problem is the use of probability smoothing such as 
Laplace correction and m-estimate, defined as follows: 

+ 1 

Laplace Pi ~ 7 7 

smoothing 

where c is the number of classes. The probability p in the m-estimate is the expected 
probability without any additional knowledge, and it is either assumed to be uniform 
ip = Me) or estimated from the training distribution. In the uniform case, which we 
used in our experiments, it is easy to see that Laplace correction is a special case of 
the OT-estimate with m=c. 



m- 

estimate 

smoothing 



Pi = 



+m - p 






Table 3. Effect of smoothing on AUC for Table 4. Effect of smoothing on AUC for 
two-class datasets (m=4). multi-class datasets (m=4). 



# 


Without 

Smoothing 


Laplace 

Smoothing 


m-estimate 

Smoothing 


# 


Without 

Smoothing 


Laplace 

Smoothing 


m-estimate 

Smoothing 


Auc 


SD 


Auc 


SD 


Auc 


SD 


Auc 


SD 


Auc 


SD 


Auc 


SD 


1 


96.8 


3.7 


97.9 


2.7 


97.9 


2.7 


26 


97.9 


1.4 


99.8 


0.3 


99.8 


0.3 


2 


71.2 


4.3 


70.2 


4.8 


70.2 


4.8 


27 


75.0 


1.8 


82.9 


2.7 


82.9 


2.7 


3 


97.7 


1.3 


99.1 


0.9 


99.1 


0.9 


28 


94.7 


2.1 


95.4 


1.6 


95.4 


1.6 


4 


76.2 


3.6 


87.2 


2.8 


87.2 


2.8 


29 


98.4 


0.8 


99.0 


0.6 


99.0 


0.6 


5 


93.6 


2.5 


98.2 


1.4 


98.2 


1.4 


30 


94.3 


3.5 


97.1 


2.7 


97.1 


2.7 


6 


100.0 


0.0 


100.0 


0.0 


100.0 


0.0 


31 


99.4 


0.3 


99.7 


0.1 


99.7 


0.1 


7 


91.9 


2.9 


96.8 


1.8 


96.8 


1.8 


32 


94.4 


1.8 


97.8 


1.0 


97.8 


1.0 


8 


93.3 


2.1 


97.9 


1.1 


97.9 


1.1 


33 


99.3 


0.1 


99.7 


0.0 


99.7 


0.0 


9 


59.6 


8.9 


64.2 


9.5 


64.2 


9.5 


34 


75.0 


7.9 


74.7 


8.5 


74.7 


8.5 


10 


90.8 


4.2 


94.6 


3.7 


94.6 


3.7 


35 


97.3 


2.2 


98.5 


1.8 


98.5 


1.8 


11 


61.1 


6.1 


67.0 


6.8 


67.0 


6.8 


36 


98.2 


0.2 


99.0 


0.1 


99.0 


0.1 


12 


67.2 


1.7 


79.0 


1.4 


79.0 


1.4 


37 


99.3 


0.2 


99.7 


0.1 


99.7 


0.1 


13 


99.5 


0.3 


100.0 


0.1 


100.0 


0.1 


38 


96.6 


2.7 


97.8 


1.9 


97.8 


1.9 


14 


66.0 


7.6 


73.5 


7.2 


73.5 


7.2 


39 


63.7 


3.7 


65.6 


3.6 


65.6 


3.6 


15 


65.5 


11.6 


75.7 


9.1 


75.7 


9.1 


40 


98.6 


2.0 


99.2 


1.1 


99.2 


1.1 


16 


90.5 


3.9 


97.9 


1.0 


97.9 


1.0 


41 


89.8 


4.2 


90.5 


4.3 


90.5 


4.3 


17 


80.7 


3.1 


85.1 


2.6 


84.6 


2.7 


42 


83.4 


1.0 


88.8 


0.9 


88.8 


0.9 


18 


64.8 


2.9 


74.3 


2.8 


74.3 


2.8 


43 


62.0 


2.6 


65.5 


2.7 


65.5 


2.7 


19 


67.4 


7.1 


74.3 


7.1 


75.0 


7.2 


44 


93.0 


2.9 


95.3 


2.5 


95.3 


2.5 


20 


56.3 


6.6 


64.3 


7.5 


64.3 


7.5 


45 


88.1 


8.4 


93.0 


5.8 


93.0 


5.8 


21 


91.7 


0.9 


96.9 


0.5 


96.9 


0.5 


46 


57.5 


8.1 


58.6 


10.4 


58.6 


10.4 


22 


66.6 


5.4 


68.5 


5.1 


68.5 


5.1 


47 


66.3 


5.4 


70.5 


4.9 


70.5 


4.9 


23 


65.9 


4.2 


75.3 


3.8 


75.3 


3.8 


48 


98.1 


1.0 


99.8 


0.2 


99.8 


0.2 


24 


89.3 


3.1 


98.7 


0.8 


98.7 


0.8 


49 


95.6 


0.6 


98.1 


0.4 


98.1 


0.4 


25 


78.7 


7.8 


87.3 


6.9 


87.3 


6.9 


50 


95.1 


0.3 


96.9 


0.3 


96.9 


0.3 


ARITM 


79.3 




85.0 




85.0 




ARITM 


88.4 




90.5 




90.5 




GEOM 


78.0 




83.9 




84.0 




GEOM 


87.3 




89.5 




89.5 





Tables 3 and 4 show the results (mean and standard deviation for the 5 x 20 itera- 
tions) without smoothing, with Laplace smoothing and with the m-estimate with uni- 
form prior (the best experimental value for m, m=4 is used). These results are similar 
to those of [12,7] and they are shown here to serve as a reference from which we will 
illustrate our own improvements. The improvement of Laplace and m-estimate 
smoothing over no smoothing is obvious — especially for two-class datasets — and 
there is no need to perform a significance test. On the other hand, there is virtually no 
difference between Laplace smoothing and the best m-estimate. 
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3 m-Branch Smoothing 

We continue to investigate whether the previous results can be further improved. In 
this section we propose a more sophisticated smoothing method called m-bmnch 
smoothing. In the next section we consider alternative splitting criteria that are de- 
signed specifically for probability estimation trees. 

First of all, the previous m-estimate and Laplace smoothing methods consider a 
uniform class distribution of the sample. That is, they consider the global population 
uniform whereas in many cases the class probabilities are unbalanced. However, just 
taking this into account does not improve the measures significantly, since each node 
takes a subsample from the upper node, and this, once again, makes a subsample of 
the upper node, until the root is reached. Usually, this means that the sample used to 
obtain the probability estimate in a leaf is the result of many sampling steps, as many 
as the depth of the leaf. It makes sense, then, to consider this history of samples when 
estimating the class probabilities in a leaf. The idea is to assign more weight to nodes 
that are closer to the leaf. 

Definition 1 (rn-Branch Smoothing). Given a leaf node I and its associated branch 
<Vj, Vj, ..., V^> where = / and Vj is the root, denote with the cardinality of class 
i at node v.. Define p"= He. We recursively compute the probabilities of the nodes 
from 1 to r/ as follows: 

pi _ n!+m-pr 

The m-branch smoothed probabilities of leaf / are given by p/. 

We note that m-branch smoothing is a recursive root-to-leaf extension of the m- 
probability estimate used by Bratko and Cestnik for decision tree pruning [5]. 

Since this is an iteration of the m-estimate, we could use a fixed value of m. How- 
ever, if we use a small m the smoothing would almost be irrelevant for upper nodes, 
which have high cardinality. On the other hand, if we use a large m the small cardinal- 
ities at the bottom of the branch would have low relevance. In order to solve this we 
use a variable value, which depends on the size of the dataset and the depth. Define 
the height of a node as h= d+\- j where d is the depth of the branch and j the depth of 
the node. The normalised height of a node is defined as A= 1 - l/h in order to increase 
the correction closer to the root. We then parametrise the m value as follows: 

;m = M -(h-A-Va) 

where M is a constant and N is the global cardinality of the dataset. The use of the 
square root of N is inspired by “the square root law”, which connects the error and the 
sample size. The previous expression means that m-branch smoothing is performed 
with a value of M at the leaves, the next node up is done with M + Vi -M ■ ^N, the next 
M + % -M -ViV until the root with M + “'“‘V, -M -Vw 

In Tables 5 and 6 we compare m-branch smoothing (with the best experimental 
value for M=4) compared with the best previous results (m-estimate). We also per- 
form a paired Ftest to test the significance of the results. The ‘Better?’ column indi- 
cates whether m-branch smoothing performs significantly better (■^) or worse (x) than 
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77r-estimate smoothing, according to r-test with level of confidence 0. 1 . A tie (-) indi- 
cates the difference is not significant at this level. 



Table 5. Comparison of m-estimate and m- Table 6. Comparison of tn-estimate and m- 
branch smoothing on two-class datasets. branch smoothing on multi-class datasets. 



# 


M-ESTIMATE 

Smoothing 


m-Branch 

Smoothing 


Better? 


# 


4-estimate 

Smoothing 


4-Branch 

Smoothing 


Better? 




Auc 


SD 


Auc 


SD 




Auc 


SD 


Auc 


SD 


1 


97.9 


2.7 


97.2 


3.4 


X 


26 


99.8 


0.3 


99.8 


0.2 


Y 


2 


70.2 


4.8 


67.4 


5.0 


X 


27 


82.9 


2.7 


81.3 


2.9 


X 


3 


99.1 


0.9 


99.1 


1.0 


- 


28 


95.4 


1.6 


95.3 


1.5 


- 


4 


87.2 


2.8 


86.9 


2.7 


. 


29 


99.0 


0.6 


99.2 


0.5 




5 


98.2 


1.4 


98.5 


1.4 


- 


30 


97.1 


2.7 


97.4 


2.7 


- 


6 


100.0 


0.0 


100.0 


0.0 


- 


31 


99.7 


0.1 


99.7 


0.1 


X 


7 


96.8 


1.8 


96.9 


1.6 


- 


32 


97.8 


1.0 


98.8 


0.7 




8 


97.9 


1.1 


98.0 


1.1 


- 


33 


99.7 


0.0 


99.8 


0.0 




9 


64.2 


9.5 


65.9 


10.0 


- 


34 


74.7 


8.5 


75.0 


8.7 


- 


10 


94.6 


3.7 


94.4 


3.7 


- 


35 


98.5 


1.8 


98.5 


1.8 


- 


11 


67.0 


6.8 


70.0 


7.1 


■/ 


36 


99.0 


0.1 


99.3 


0.1 




12 


79.0 


1.4 


82.2 


1.4 




37 


99.7 


0.1 


99.7 


0.1 




13 


100.0 


0.1 


99.9 


0.1 


X 


38 


97.8 


1.9 


97.8 


1.8 


- 


14 


73.5 


7.2 


75.7 


6.3 




39 


65.6 


3.6 


69.1 


3.6 




15 


75.7 


9.1 


77.5 


9.4 


- 


40 


99.2 


1.1 


98.6 


2.2 


X 


16 


97.9 


1.0 


98.1 


1.0 


- 


41 


90.5 


4.3 


91.7 


4.2 




17 


84.6 


2.7 


86.1 


2.7 




42 


88.8 


0.9 


95.0 


0.5 




18 


74.3 


2.8 


75.7 


2.6 




43 


65.5 


2.7 


71.1 


2.4 




19 


75.0 


7.2 


77.9 


7.0 




44 


95.3 


2.5 


95.6 


2.6 


- 


20 


64.3 


7.5 


67.3 


7.0 




45 


93.0 


5.8 


91.7 


7.1 


- 


21 


96.9 


0.5 


97.0 


0.5 


- 


46 


58.6 


10.4 


59.3 


12.0 


- 


22 


68.5 


5.1 


68.5 


5.2 


- 


47 


70.5 


4.9 


76.2 


5.2 




23 


75.3 


3.8 


78.8 


3.3 




48 


99.8 


0.2 


99.8 


0.3 


- 


24 


98.7 


0.8 


98.7 


0.8 


- 


49 


98.1 


0.4 


98.7 


0.3 




25 


87.3 


6.9 


87.4 


6.9 


- 


50 


96.9 


0.3 


98.3 


0.2 




ARTTMEAN 


85.0 




85.8 




8 wins, 14 ties, 


aritmean 


90.5 




91.5 




13 wins, 9 ties. 


GEOMEAN 


84.0 




84.9 




3 losses 


geomean 


89.5 




90.6 




3 losses 



The results (21 wins, 23 ties, 6 losses) show that there are many cases where the dif- 
ference is not significant (especially when the AUC was close to 100) but there are 
many more cases where the results are improved than degraded. In overall geometric 
means, m-branch smoothing improves AUC with 1% from 86.7% to 87.7%. 



4 Splitting Criteria for PETs 

A crucial factor for the quality of a decision tree learner is its splitting criterion. A 
variety of splitting criteria, including Gini [3], Gain, Gain Ratio and C4.5 criterion 
[13], and DKM [10] have been presented to date. However, all these were designed 
and evaluated for classifiers, not for probability estimators. In this section we propose 
and investigate two splitting criteria specifically designed for PETs. 



4.1 MAUC Splitting Criterion 



In [7] we introduced a novel splitting criterion, which was aimed at maximising the 
AUC of the resulting tree rather than its accuracy. It simply computes the quality of 
each split as the AUC of the nodes resulting from that split, assuming a two-class 
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problem. This can be generalised to more than two classes using Hand and Till’s 1-vs- 
1 average [9]. 

Definition 2 (MAUCsplit). Given a split s, the quality of the split is defined as: 

MAUCsplit (s) = MAUC(t^) 

where indicates the tree with the node being split as root. 

The idea of using the same measure for splitting that is used as well for evaluation 
seems straightforward. Nonetheless, in the same way that accuracy (expected error) is 
not necessarily the best splitting criterion for accuracy, MAUCsplit may not the best 
splitting criterion for MAUC. 



4.2 MSEE Splitting Criterion 

A different approach is to consider that the tree really predicts probabilities. It thus 
makes sense to minimise the quadratic error committed when guessing these prob- 
abilities. Consider a split where each of the children has estimated probabilities p^ for 
each class. Assume that nodes assign classes according to p^. Consequently, p. means 
the probability of examples of class i falling into the node but also means the prob- 
ability of being classified as i. Assuming these two interpretations of p. are independ- 
ent, the probability that an example of class i is misclassified, denoted by p^., can be 
estimated as follows: 

Pe.i = Pi ■Y.Pj =P.' -(l-P,) 

i*i 

In words, this combines the probability that an example is of class p. and the probabil- 
ity that it is not classified accordingly (the sum of the rest of probabilities, which is 1 
- p). This is similar to the Gini index. However, we want to measure the quadratic 
error of the prediction, which in our case is not a class but a probability. Hence, given 
a misclassification: 

_ p. should have been 1 but is p.. Thus, the error can be estimated as (1 - pf. 

- Pj 0 ^ 0 should have been 0 and is p.. The error can be estimated as (0 - p^^. 
Consequently, we have a total quadratic error of: 

Error, = Pp p p, ■ {I- +^p 

Therefore, if we consider a split of n nodes, then we can compute the quality of the 
split as the negative value of the total error for all the nodes: 

Definition 3 (MSEEsplit). Given a split s, the quality of the split is defined as: 

MSEEsplit(s) = ^ Error, 

k=l..n V i=l..c / 

where q,, indicates the relative cardinality of the k-th child in the split. 

The way in which the error is obtained gives the name for the criterion: Minimum 
Squared Expected Error (MSEE). Note that this expression is similar to the Brier 
score [4], which has also been used recently as a measure for predictive models in 
similar applications as where AUC is used. 
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Both MAUCsplit and MSEEsplit are modified in order to penalise splits with a 
high number of children, in a similar way as GainRatio is a modification of the Gain 
criterion. The precise correction we have used in the experiments can be found in [8]. 



4.3 Splitting Criteria Comparison 

We have compared several splitting criteria; GainRatio (as implemented in C4.5, i.e., 
considering only the splits with Gain greater than the mean [13]), MGINI (as imple- 
mented in CART [3]), DKM (as presented in [10]), MAUCsplit with children correc- 
tion and MSEEsplit with children correction. We will show the results with the split 
smoothing that gives better results for each criterion. This smoothing has not to be 
confused with the smoothing used for computing the AUC for evaluating the PETs, 
which will always be m-branch smoothing. Table 7 summarises the results (the com- 
plete results can be found in [8]). 



Table 7. Summary of Accuracy and AUC for several splitting criteria (geometric means). 







C4.5SPLIT GAIN 


MGINI 


DKM 


MAUCsplit 


MSEE SPLIT 


Better? MSEE vs C4.5 


2-CLASS 


Accuracy 


81.4 


81.6 


81.4 


81.7 


81.8 


82.0 


11 wins, 9 ties, 5 losses 


AUC 


84.9 


84.8 


84.6 


84.8 


85.0 


85.3 


7 wins, 13 ties, 5 losses 


>2-CLASS 


Accuracy 


82.8 


83.0 


83.1 


83.1 


82.4 


83.0 


10 wins, 1 1 ties, 4 losses 


AUC 


90.6 


90.9 


90.8 


91.1 


90.8 


90.9 


7 wins, 13 ties, 5 losses 


ALL 


Accuracy 


82.1 


82.3 


82.2 


82.4 


82.1 


82.5 


21 wins, 20 ties, 9 losses 


AUC 


87.7 


87.8 


87.7 


87.9 


87.8 


88.1 


14 wins, 26 ties, 10 losses 



According to these and previous results, the best DTC splitting criterion is DKM, but 
the difference is not significant with the rest of DTC criteria (MGINI, C4.5). The new 
criterion MAUCsplit is slightly better than C4.5 and MGINI, although differences are 
small and not significant. Einally, MSEEsplit appears to be the best, although differ- 
ences are smaller with respect to C4.5 and even smaller with respect to DKM. The 
good behaviour of both MSEE and DKM may be explained because both methods use 
quadratic terms. 



5 Pruning and PETs 

As we have discussed in the introduction, in [12] it is argued that pruning is counter- 
productive for obtaining good PETs and, consequently, pruning (and related tech- 
niques) should be disabled. However, it is not clear whether the reason is that pruning 
is intrinsically detrimental for probability estimation, or that existing pruning methods 
are devised for accuracy and not for increasing AUC. 

Independently, we have evaluated some classical pre-pruning and post-pruning 
methods, such as Expected Error Pruning and Pessimistic Error Pruning (see e.g. [6] 
for a comparison). Our results match those of [12]; even slight pruning degrades the 
quality (measured in terms of AUC) of the probability estimates. It seems that 
smoothing has a relevant effect here: if we disable smoothing, pruning is beneficial in 
some cases. Consequently, it looks as though the better the smoothing at the leaves is, 
the worse pruning will be. It appears that this will be especially true for our m-branch 
smoothing, since it takes into account all the branch nodes probabilities. Pruning will 
reduce the available information for estimating the probabilities. As a result, we do 
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not expect to obtain new pruning methods that will increase the AUC of a PET, hut 
we might he interested in designing pruning methods that reduce the size of the tree 
without degrading too much the quality of the PET. 

One of the most important issues for estimating good probabilities is the size of the 
sample. Consequently, the poorest estimates of a PET will he obtained by the smallest 
nodes. If we have to decide to prune some nodes it makes sense to prune the smallest 
ones first. This would suggest a very simple pre-pruning method: nodes will not be 
expanded when their cardinality is lower than a certain constant. However, datasets 
with a large number of classes can have poor probability estimates with medium-large 
nodes if there are many small classes. Hence, we can refine cardinality-hased pruning, 
hy using the following definition: 

Definition 4 (CardPerClass Pruning). Given a node /, it will be pruned when: 

Card{l) < 2— 
c 

where Card(l) is the cardinality of node /, is a constant (K=0 means no pruning) 

and c is the number of classes. 



In the following graph, we show the effect of CardPerClass pruning (with A'-values 
ranging from 16 to 0). The results are shown for MSEEsplit with m-hranch smooth- 
ing. 




Accuracy 
■ AUC 
# rules 



Fig. 1. Accuracy, AUC and number of rules for several pruning degrees (geometric mean). 

As can be seen in Eigure 1, only strong pruning is counterproductive for accuracy 
(and even behaves worse than other pruning methods). It is more interesting to ob- 
serve the evolution of the AUC curve. The graph suggests that the quality of a PET is 
not significantly decreased until K=4, which, on the other hand, leads to a consider- 
able decrease in the complexity of the trees. 



6 Summary 

In previous sections we have presented several enhancements in order to improve the 
AUC of PETs. In order to see the whole picture, we show below the accumulated 
progress of the techniques presented before. 
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Although there is a considerable improvement obtainable by using a simple 
smoothing such as Laplace smoothing (as shown previously [12,7]), there was still 
place for further improvement, as can be seen in Table 8. According to the nature and 
number of the datasets, and the quantity and quality of work developed for improving 
decision trees, we think that this is a significant result. 

Table 8. Summary Table of AUC (only AUC and geomeans shown). 



C4.5SPLIT WITH- C4.5SPLIT WITH C4.5SPLIT WITH MSEESPLIT WITH C4.5LAP VS MSEESPLIT 
OUT SMOOTH LAPLACE SMOOTH MBRANCH SMOOTH MBRANCH SMOOTH MBRANCH + K= 1 PRUNING 





ACC 


Auc 


Acc 


Auc 


Acc 


Auc 


Acc 


Auc 


Better IN AUC? 


2-CLASS 


81.4 


78.0 


81.4 


83.9 


81.4 


84.9 


82.1 


85.4 


1 1 wins, 9 ties, 5 losses 


>2-CLASS 


82.8 


87.3 


82.8 


89.5 


82.8 


90.6 


83.1 


90.9 


16 wins, 4 ties, 5 losses 


ALL 


82.1 


82.5 


82.1 


86.7 


82.1 


87.7 


82.6 


88.1 


27 wins, 13 ties, 10 losses 



7 Conclusions and Future Work 

In this work we have reassessed the construction of PETs, evaluating and introducing 
new methods for the three issues that are most important in PET construction: leaf 
smoothing, splitting criteria and pruning. We have introduced a new iir-branch 
smoothing method that takes the whole branch of decisions into account, as well as a 
new MSEE splitting criterion aimed at reducing the squared error of the probability 
estimate. 

Our new m-branch smoothing is significantly better than previous classical 
smoothings (Laplace or m-estimate). With respect to the splitting criteria, there are 
few works that compare existing splitting criteria for accuracy. Moreover, to our 
knowledge, this is the first work that compares the ranking of probability estimates of 
several splitting criteria for PETs. At this point, the conclusion is that all the good 
criteria presented so far are also good criteria for AUC and the differences between 
them are negligible. Nonetheless, pursuing new measures, we have found new split- 
ting criteria such as AUCsplit and MSEEsplit comparable to the best known criterion 
(or even better, although this is not conclusive). Finally, we have shown that a simple 
cardinality pruning method can be applied (to a certain extent) to obtain simpler PETs 
without degrading their quality too much. Consequently, the idea that pruning is in- 
trinsically bad for PETs is still in question, or, at least, we reiterate that a statement of 
its negative influence is “inconclusive” [12]. A very recent work has also suggested 
that a mild pruning could be beneficial [11]. 

As future work, other methods for improving the estimates (without modifying the 
structure of a single tree) such as the method presented in [11] (which uses the fre- 
quencies of all the leaves on the trees) could yield a method that takes into account all 
the information in the tree. Additionally, we think that better pruning methods for 
PETs could still be developed (considering the size of the dataset as an additional 
factor) — these might include the use of the m-branch estimate for pruning (as similar 
measures were originally introduced [5]). 
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Abstract. The EM algorithm is a popular method for maximum like- 
lihood estimation of Bayesian networks in the presence of missing data. 
Its simplicity and general convergence properties make it very attractive. 
However, it sometimes converges slowly. Several accelerated EM methods 
based on gradient-based optimization techniques have been proposed. In 
principle, they all employ a line search involving several NP-hard likeli- 
hood evaluations. We propose a novel acceleration called SCGEM based 
on scaled conjugate gradients (SCGs) well-known from learning neural 
networks. SCGEM avoids the line search by adopting the scaling mecha- 
nism of SGGs applied to the expected information matrix. This guaran- 
tees a single likelihood evaluation per iteration. We empirically compare 
SCGEM with EM and conventional conjugate gradient accelerated EM. 
The experiments show that SGGEM can significantly accelerate both of 
them and is equal in quality. 



1 Introduction 

Bayesian networks [19] are one of the most important frameworks for represent- 
ing and reasoning with probabilities. They specify joint probability distributions 
over finite sets of random variables, and have been applied to many real-world 
problems in diagnosis, forecasting, sensor fusion etc. Over the past years, there 
has been much interest in the problem of learning Bayesian networks from data. 
For learning Bayesian networks, parameter estimation is a fundamental task not 
only because of the inability of humans to reliably estimate the parameters, but 
also because it forms the basis for the overall learning problem [6] . 

It is often desired to find the parameters maximizing the likelihood (ML). 
The likelihood is the probability of the observed data as a function of the un- 
known parameters with respect to the current model. Unfortunately in many 
real-world domains, the data cases available are incomplete, i.e., some values 
may not be observed. For instance in medical domains, a patient rarely gets all 
of the possible tests. In presence of missing data, the maximum likelihood esti- 
mate typically cannot be written in closed form. It is a numerical optimization 
problem, and all known algorithms involve nonlinear, iterative optimization and 
multiple calls to a Bayesian network inference as subroutines. The latter ones 
have been proven to be NP-hard [4]. The most common technique for ML pa- 
rameter estimation of Bayesian networks in the presence of missing data is the 
Expectation-Maximization (EM) algorithm. 
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Despite the success of the EM algorithm in practice due to its simplicity 
and fast initial progress, it has been argued (see e.g. [8,13] and references in 
there) that the EM convergence can be extremely slow, and that more advanced 
second-order methods should in general be favored to EM. In the context of 
Bayesian networks, Thiesson [21], Bauer et al. [1], and Ortiz and Kaelbling [17] 
investigated acceleration of the EM algorithm. All approaches rely on conven- 
tional (gradient-based) optimization techniques viewing the change in values 
in the parameters at an EM iteration as generalized gradient (see [18] for a 
nice overview). Gradient ascent yields parameterized EM, and conjugate gradi- 
ent yields conjugate gradient EM (CGEM). Although the accelerated EMs can 
significantly speed-up the EM, they all require more computational efforts than 
the basic EM. One reason is that they perform in each iteration a line search to 
choose an optimal step size. There are drawbacks of doing a line search. First, 
a line search introduces new problem-dependent parameters such as stopping 
criterion. Second, the line search involves several likelihood evaluations which 
are NP-hard for Bayesian networks. Thus, the line search dominates the compu- 
tational costs resulting in a disadvantage of the accelerated EMs compared to 
the EM which does one likelihood evaluation per iteration. The computational 
extra costs have to be amortized over the long run to gain a speed-up. 

The contribution of the present paper is a novel acceleration of EM called 
scaled CGEM (SGGEM) which overcomes the expensive line search. It evaluates 
the likelihood as often as the EM per iteration namely once. This also explains 
the title of the paper “A Fast Accelerated EM”. SGGEM adopt the ideas un- 
derlying scaled conjugate gradients (SGGs) which are well-known from the field 
of learning neural networks [15]. SGGs employ an approximation of the Hes- 
sian of the scoring function to quadratically extrapolate the minimum instead 
of doing a line search. Then, a Levenberg-Marquardt approach [12] scales the 
step size. SGGEM adopts the scaling mechanism for maximization and applies 
it to the expected information matrix. This type of accelerated GGEM is novel. 
Other work for learning Bayesian networks investigated only approximated line 
searches [21,17,1], thus evaluates the likelihood at least twice per iteration. From 
the experimental results, we will argue that SGGEM can significantly accelerate 
both EM and GGEM, and is equal in quality. 

We proceed as follows. After briefly introducing Bayesian networks in Sec- 
tion 2, we review maximum likelihood estimation via the EM and gradient ascent 
in Section 3. In Section 4, we motivate why accelerating EM is important and 
review basic acceleration techniques. Afterwards we introduce SGGEM in Sec- 
tion 5. In Section 6, we experimentally compare the SGGEM algorithm with the 
EM and GGEM algorithms. Before concluding, we discuss related work. 

2 Bayesian Networks 

Throughout the paper, we will use X to denote a random variable, x to denote 
a state and X (resp. x) to denote a vector of variables (resp. states). We will use 
P to denote a probability distribution, e.g. P(X), and P to denote a probability 
value, e.g. P{x). 
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A Bayesian network [19] represents the joint probability distribution P(X) 
over a set X = {Xi, . . . , Xn} of random variables. In this paper, we restrict 
each Xi to have a finite set x}, . . . ,xj' of possible states. A Bayesian network is 
an augmented, acyclic graph, where each node corresponds to a variable Xi and 
each edge indicates a direct influence among the variables. We denote the parents 
of Xi in a graph-theoretical sense by Pa^. The family of Xi is Fa^ := {AijUPai. 
With each node Xi, a conditional probability table is associated specifying the 
distribution P{Xi \ Pa^). The table entries are 9ijk = P{Xi = xj \ Pa^ = pa^) , 
where paf denotes the fcth joint state of the Xi’s parents. The network stipulates 
the assumption that each node Xi in the graph is conditionally independent 
of any subset A of nodes that are not descendants of Xi given a joint state 
of its parents. Thus, the joint distribution over X factors to P(Ai, . . . , A„) = 
nr=iP(x,iPa,) . In the rest of the paper, we will represent a Bayesian network 
with given structure by the M-dimensional vector 6 consisting of all 9ijk's. 

3 Basic ML Parameter Estimation 

Our task is to learn the numerical parameters 0^-^ for a Bayesian network of 
a given structure. More formally, we have some initial model 6. We also have 
some set of data cases D = {di, . . . , d^r}. Each data case d^ is a (possibly) 
partial assignment of values to variables in the network. We assume that the 
data cases are independently sampled from identical distributions (iid. We seek 
those parameters 6* which maximize the likelihood L(D,6) P(D \ 6) of the 
data. Due to the monotonicity of the logarithm, we can also seek the parameters 
maximizing the log-likelihood LL(D, 6) logP(D | 9). This simplifies because 
of the iid assumption to 0 * = argmax^g.^ log P(d/ | 6 ) . Thus, the search 
space % to be explored is spanned by the space over the possible values of 6. In 
case of complete data D, i.e., the values of all random variables are observed, 
Lauritzen [II] showed that maximum likelihood estimation simply corresponds 
to frequency counting. However, in the presence of missing data the maximum 
likelihood estimates typically cannot be written in closed form, and iterative 
optimization schemes like the EM or gradient-based algorithms are needed. We 
will now briefly review both approaches because SCGEM heavily builds on them. 

The EM algorithm [5] is a classical approach to maximum likelihood esti- 
mation in the presence of missing values. The basic observation underlying the 
Expectation-Maximization algorithm is: learning would be easy if we knew the 
values for all the random variables. Therefore, it iteratively performs two steps 
to find the maximum likelihood parameters of a model: (E-Step) Based on the 
current parameters 6 and the observed data D, the algorithm computes a dis- 
tribution over all possible completions of each partially observed data case. (M- 
Step) Each completion is then treated as a fully-observed data case weighted by 
its probability. New parameters are then computed based on frequency counts. 

More formally, the E-step consists of computing the expectation of the like- 
lihood given the old parameters 0" and the observed data D, i.e.. 



136 Jorg Fischer and Kristian Kersting 



Q(0|0^D) = i?[logP(Z|0)|0^D] . (1) 

Here, Z is a random vector denoting the completion of the data cases D. The 
current parameters 9^ and the observed data D give us the conditional dis- 
tribution governing the unobserved states. The expression denotes the 

expectation over this conditional distribution. Q is sometimes called the ex- 
pected information matrix. In the M-step, Q{9 \ 0",D) is maximized w.r.t. 0, 
i.e., 0”“''^ = argmaxg Q(0 I 0"', D) . Lauritzen [11] showed that this leads to 
9*-k = ec(fa^^ I D)/ec(pa^ | D) where is the joint state consisting of the 
jth state of variable Xi and the fcth joint state paf of Pa^. The term ec(a | D) 
denotes the expected counts of the joint state a given the data. They are com- 
puted by ec(a | D) = I d/,0") where any Bayesian network inference 

engine can be used to compute P(a | d/,0"). 

Gradient ascent , also known as hill climbing, is a classical method for finding 
a maximum of a (scoring) function. It iteratively performs two steps. First, it 
computes the gradient vector Ve of partial derivatives of the log-likelihood with 
respect to the parameters of a Bayesian network at a given point 6 G 'H. Then, 
it takes a small step in the direction of the gradient to the point O-i-SVe where 
S is the step-size parameter. The algorithm will converge to a local maximum 
for small enough 6, cf. [12]. Thus, we have to compute the partial derivatives of 
LT(D, 0) with respect to 9ijk ■ According to Binder et al. [3], they are given by 

dLL{T>,9) _^N P(faf 1 d/,0) ec(faf ] D) ^ 

99ijk ^—1 0Z///C 0Z///C 

In contrast to EM, the described gradient-ascent method has to be modified to 
take into account the constraint that the parameter vector 0 consists of prob- 
ability values, i.e., 9ijk £ [0,1] and 9ijk = 1 . A general solution is to repa- 
rameterize the problem so that the new parameters automatically respect the 
constraints on 9ijk no matter what their values are. More precisely, we define the 
parameters G K such that 0yfe = exp{Pijk)/{J2i(^^PiPiik)) where the Pijk 
are indexed like 9ijk- It can be shown using the chain rule of derivatives that 

dLL{T>,9) _ y. dLL(D,9) d9epk' _ y- ec(fa^?' ] D) 89,^^,^' 
8Pijk i'j'k' dd^fj/k' 8Pijk i'j'k' 9i'j'k' dPijk 

= ec(faf 1 D) - 9ijk ec(faf ] D) . 

An important view of the gradient for our purposes is the following. It highlights 
the close connection between gradient ascent and the EM: 



dQ{9 1 0',D) 
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Thus, the gradient of the likelihood coincides with the gradient of the expected 
information matrix^. 

4 Accelerated ML Parameter Estimation 

It has been argued that the EM convergence can be extremely slow, see e.g. [13,8] 
and references in there. Assume that (0”) is a sequence of parameters computed 
by the EM algorithm. Furthermore, assume that (0") converges to some 9 * . 
Then, in the neighbourhood of 0*, the EM algorithm is essentially a linear iter- 
ation. However, as shown by Dempster et al. [5], the greater the proportion of 
missing information, the slower the rate of convergence (in the neighbourhood of 
6*). If the ratio approaches unity, EM will exhibit slow convergence. Therefore, 
more advanced second-order methods should in general be favored to EM. 

In the context of Bayesian networks, Kersting and Landwehr [9] empirically 
compared second-order gradient techniques with the EM. They were able to show 
that the former ones can be competitive with EM, but EM remaines the domain 
independent method of choice for ML estimation because of its simplicity and 
fast initial progress. Within the Bayesian network learning community several 
accelerated EM algorithms have been proposed and empirically compared with 
the EM algorithm [21,17,1]. We will now briefly review the basic acceleration 
techniques following [17] which are also the basic ingredients of SCGEM. 

The most straightforward way to accelerate the EM is the parameterized 
EM (PEM). It is a gradient ascent where instead of following the gradient of 
the log-likelihood we follow the generalized^ gradient := 0 em^ — 0" , where 
denotes the EM update with respect to the parameters after the n + 1-th 
iteration. The simple PEM uses a fixed step size 6 when following the gradient, 
i.e., in every iteration the parameters are chosen according to 0"“*"^ = 0" -|-(5 • . 

It is not a priori clear how to choose 6. Instead, it would be better to perform a 
series of line searches to choose 5 in each iteration, i.e., to do a one dimensional 
iterative search for 5 in the direction of the generalized gradient maximizing 
LL(D,0’" +6 ■ Pn)- One of the problems with the resulting algorithm is that a 
maximization in one direction could spoil past maximizations. This problem is 
solved by conjugate gradient EM. 

Conjugate gradient EM (CGEM) computes so-called conjugate directions 
ho, hi, . . ., which are orthogonal, and estimate the step size along these direc- 
tions with line searches [7]. Following the scheme of PEM, it iteratively performs 
two steps starting with 0° € "H and hg = go- 

1. (Conjugate directions) Set the next direction h„+i according to h„+i = 

ggn+i -Jn-hn where 

^ (VLL(D, - V£L(D, 0")^) • 

(VLL(D,0’^+i)i’ - VLL(D,0")^) • h„ ' ^ 

^ This means that gradient ascent is a so-called generalized EM. 

^ Generalized gradients perform regular gradient techniques in a transformed space. 
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2. (Line search) Compute 0”^^ by maximizing LL(D, 0" +J • hn+i) in the di- 
rection of hk+i- 

Usually, some initial EM steps are taken before switching to the CGEM. We 
refer to [7,13] for more details on CGEM. 

There are still drawbacks of doing a line search. First, it introduces new 
problem-dependent parameters such as a stopping criterion. Second, the line 
search involves several likelihood evaluations, i.e., network inferences which are 
known to be NP-hard [4]. Thus, the line search dominates the computational 
costs of CGEM resulting in a serious disadvantage compared to the EM which 
does one likelihood estimation per iteration. Therefore, it is not surprising that 
researchers in the Bayesian network learning community used inexact line search 
to reduce the complexity [21,17,1] when accelerating EM. Nevertheless, they 
require at least one additional likelihood evaluation per iteration compared to 
the EM. In the next section, we will show how to avoid the line search. We 
will adopt a variant of conjugate gradients called scaled conjugate gradients to 
accelerate EM. They are due to Mpller [15] and led to significant speed up of 
learning neural networks while preserving accuracy. 



5 Scaled Conjugate Gradient EM 



Scaled conjugate gradient (SCG) substitutes the line search by employing an 
approximation of the Hessian of the error function to quadratically extrapolate 
the minimum using a Levenberg-Marquardt approach [12] to scale the step size. 
We will now apply this idea to CGEM in the context of maximum likelihood 
parameter estimation of Bayesian networks. Due to space restrictions, we will 
discuss the basic idea only. For more details about SCG, we refer to [15]. 

We quadratically approximate LT(D, 0") at x, i.e., LL(jD, 0”)-|-VLT(D, 0”)^- 
X -h • V2TL(D, 0”) • X . Then, the step size for CGEM is ■ VTL(D, 0”))/ 
{h^ ■ V^TL(D,0") • hn) ■ However, one has to compute and store the Hessian 
matrix V^LL(D, 0") . To avoid this, Mpller estimated V^LL(D, 0”) • with a 
Newton quotient. The approximation needs to be negative definite in order to ob- 
tain a maximum. Therefore, a scalar A„ is introduced to assure the definiteness, 
i.e.. 






VTL(D, 0- +a" • /ifc+i) - VLT(D, 0") q ^ 1 

„ T Aji * Uti, L) < fj 1 

rrfl 



(5) 



The sign of 5n '■= -Sn reveals if the adjusted Hessian approximation is negative 
definite. If > 0 then the approximation is not negative definite, and A„ is 

raised and is estimated again. The new approximation sl^ can be derived 

from the old via s'^ = Sn + (A^ — A(j) • where is the raised A„. It is 
possible to determine in one step how much A„ has to be risen to assure Sn < 0: 



< 0 < 0 (5„ -k (A„ - a;) • \hn\^ <0<^Xn> Xn + Sn/\hn\'^ . 



Following Mpller, we chose A(j = 2.0 • (A„ -I- <5„/l/i„]^), i.e., S'n = —Sn — A • ]/i„l^ 
as new estimate for Sn- 
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This scaling mechanism combined with conjugate search directions as done 
in the CGEM (4) leads (in principle) to the SCGEM. For the sake of closeness 
to SCGs, we used in (5) the approximation of the second order information orig- 
inally proposed by Mpller. It requires one additional log-likelihood evaluation 
compared with the EM. Motivated by (1) and (3), we apply a different approx- 
imation in order to avoid the additional evaluation, namely V^LT(D,0") Ri 
E [V^LL{Z I 9) I 6/”,D] . It turns out that we approximate the Hessian of the 
log-likelihood V^LL(D,0") with the Hessian of the expected information score 
V2Q(0 I 6)”,D): 



E 



LL{Z I 9) 



j'k' ^Pijk 



0",D 



= E 



d 



yk 



— c(faf |D)-%fc^c(faf ID) 



0",D 



= -^ec(faf |D) 
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ec 



j'k' 



dPi' j'k' 

(faf |D)-%fc^ec(faf |D) ) = 



(6) 



d^Q{9 I 0",D) 

8j3if j ' k' dPij k 



Now, recall that we are working in the reparameterized space, i.e., the /3 param- 
eters are independent. The partial derivatives in (6) are zero when i' , i, f , j, 
and k' , k. Thus, only the diagonal elements of V^Q{9 \ 0",D) are non-zero. 
They are given by 

%fe-(%fe-l)-^ec(faf |D) . 

i 

Moreover, the computation of becomes linear in the number of parameters, 
namely 

Sn = diag(V^Q(0 | 0”, D))^ • /i„ -I- A„ • /i„ . (7) 

Because all involved quantities have been already computed for the gradient, 
the approximation does not cost any additional log-likelihood evaluations. Thus, 
SCGEM performs as many evaluations as the EM per iteration namely one. 



6 Experiments 

In the experiments described below, we implemented all three algorithms, i.e., 
EM, CGEM, and SCGEM using Netica API (http://www.norsys.com) for 
Bayesian network inference. We adapted the conjugate gradient described in [20] 
to fullfil the constraints described above and to become a CGEM. Based on this 
code, we adapted the scaled conjugate gradient as implemented in the Netlab 
library [16] to come up with the SCGEM. Two initial EM steps and a maximum 
of three consecutive scaling steps were set. To avoid zero entries in the associated 
conditional probability tables we used m-estimates (m = 1) [14]. 
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Table 1. Description of the networks used in the experiments. 



Name 


Description 


Alarm 


Well-known benchmark network for the ICU ventilator management. 
It consists of 37 nodes and 752 parameters [2]. No latent variables. 


Insurance 


Well-known benchmark network for estimating the expected claim costs 
for a car insurance policyholder. It consists of 27 nodes (12 latent) and 
1419 parameters [3]. 


3-1-3, 

5-3-5 


Two artificial networks with a feed-forward architecture known from 
neural networks [3]. There are three fully connected layers of 3 x 1 x 3 
(resp. 5x3x5) nodes. Nodes of the first and third layers have 3 possible 
states, nodes of the second 2. In total, there are 81 (resp. 568) parameters. 



Data were generated from four Bayesian networks whose main characteristics 
are described in Table 1. From each target network, we generated a test set of 
10000 data cases, and (independently) training sets of 100, 200, 500 and 1000 
data cases with a fraction of 0, 0.1, 0.2 and 0.3 of values missing at random of the 
observed nodes. The training sets were all subsets of the corresponding 1000 data 
sets. The values of latent nodes were never observed. For each training set, 10 
different random initial sets of parameters were tried. We ran each algorithm on 
each data set starting from each of the initial sets of parameters. Each algorithm 
stopped when a limit of 200 iterations was exceeded or a change in log-likelihood 
at one iteration relative to the total change so far was smaller than 1 -|- 10“^. 
All learned models were evaluated on the test set using a standard measure, the 
normalized-loss l/A^^^^(logP*(d;) — logP(di)) , where P* is the probability 
of the generating distribution. Normalized-loss measures the additional penalty 
for using an approximation instead of the true model, and is also a cross-entropy 
estimate. The closer the normalized-loss is to zero, the better. 

To measure the computational costs, we counted the number of log- likelihood 
evaluations. This way we do not need to compare CPU times, which depend on 
implementation details. More over, the evaluations domininate all other costs for 
sufficiently large number N of data cases (assuming unit costs of arithmetic op- 
erations). All three algorithms require the computation of the expected counts, 
i.e., a likelihood evaluation. Computing (4) takes 0{M) extra work, and com- 
puting (7) takes 0{M ■ d) extra work where d is the maximal number of possible 
states of a random variable in a given Bayesian network of M parameters. Thus, 
the work needed to be done in each (line search) iteration is approximately the 
same. This complies with our experience gained in the experiments. 

The CGEM showed the expected behaviour. Considering only the normalized- 
loss, it reached similar normalized-losses than the EM. The 95% confidence 
(EM— CGEM) interval was [—0.0175, —0.0099] favoring the EM. However, the 
CGEM had much higher computational costs than SCGEM. On average 19 line 
search iterations. To be fair, it has been argued that the CGEM does not need a 
very precise line search. For instance, Ortiz and Kaelbling [18] set a maximum of 
10 line search iterations, and reported that this limit was often exceeded. To sim- 
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Fig. 1. The 95% confidence intervals for differences in number of iterations 
(EM— SCGEM). For each number of data cases (x axis) and each percentage of miss- 
ing data (different line styles), the confidence intervals are shown. The figures within 
figures are close-ups of the corresponding areas. Each table below a figure shows the 
corresponding mean number of iterations for EM (upper row) and SCGEM (lower row) 
in order of appearance within the figure and rounded to the closest integer value. 



ulate this, we fixed the log-likelihoods for CGEM originally reached, but varied 
the averaged number of line search iterations. It turned out that SCGEM would 
run significantly faster than the CGEM already with a limit of 3 line search 
inferences (95% confidence interval CGEM— SCGEM for a limit of 3 would be 
[22.42,27.20], for a limit of 10 it would be [115.98, 128.86])^. Therefore, we omit 
CGEM from further discussion. 

EM and SCGEM reached similar normalized-losses, i.e., were equal in quality. 
A two-tailed, paired sampled t test over all experiments showed that we cannot 
reject the null-hypothesis that EM and SCGEM differ on average performance 
(p = 0.05). There was a very small variation over the different networks. The con- 
fidence intervals (95%) for normalized-losses were: Alarm [—0.00116, —0.00025], 
Insurance [-0.00848, -b0.00120], 3-1-3 [-0.00559, -b0.00014] and 5-3-5 

[-b0.00591,-b0.1620]. 

^ Note that this analysis actually favors the CGEM. It is unlikely that the CGEM 
would reach the original log-likelihoods within one line search. 
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Fig. 2. Typical learning curves for EM and SCGEM on Alarm and 5-3-5. The number 
of iterations, i.e., the number of log-likelihood evaluations is plotted against the log- 
likelihood achieved. The lower right plot shows the SCGEM getting trapped in scaling. 
The figures within figures are close-ups of the corresponding areas. 



In total, SCGEM was faster then EM. Figure 1 summarizes the difference 
(EM— SCG) in number of iterations, i.e., number of likelihood iterations. The 
95% confidence intervals are plotted. Readably, EM was faster than SCGEM on 
Alarm. However, there was only a difference of around 6 iterations. Moreover, 
this did not carry over to the other domains. SCGEM tended to be slightly faster 
on 3-1-3 (~ 5 iterations speed-up on average). On Insurance and 5-3-5 , the 
difference in average iterations increased to a ~ 16 speed-up for Insurance and 
a ~ 39 speed-up for 5-3-5 . However, this is averaged over different percentages 
of missing data. As Figure 1 shows, there can be speed-ups of more than 80 
iterations depending on the percentage of missing informations. These results 
validate the theory as explained in Section 4: the more missing information, the 
higher the speed-up. 

Compared with SCG, Kersting and Landwehr [9] reported that the EM had a 
faster initial progress. Compared to SCGEM, this does not hold any longer. Fig- 
ure 2 shows some typical learning curves. For Alarm, the EM typically possessed 
a faster initial progress, but e.g. for 5-3-5, the SCGEM typically overtook the 
EM after few iterations. However in contrast to the EM, the SCGEM got some- 
times trapped in scaling. Surprisingly, this happened most of time for Alarm 
(54% of 136 cases) where SCGEM got no speed-up. The remaining trapped cases 
were distributed as follows: 6% on 3-1-3, 16% on Insurance, and 24% on 5-3- 
5. Interestingly, SCGEM reached better normalized-losses on 5-3-5 (taking the 
‘scaling traps’ into account). 
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7 Related Work and Conclusions 

Both, EM and gradients are well-known parameter estimation techniques. For a 
general introduction see [13,12]. Mpller [15] introduced SCG to estimate the pa- 
rameters of neural networks. Lauritzen [11] introduced EM for Bayesian networks 
(see also Heckerman [6] for a nice tutorial) . The original work on gradient-based 
approaches in the context of Bayesian networks was done by Binder et al. [3]. 
However, we are not aware of any application of SCG in order to accelerate the 
EM. Bauer et al. [1] reported on experiments with PEM for learning Bayesian 
networks. They did not report on results of CGEM. Thiesson [21] discussed 
conventional conjugate gradient accelerations of the EM, but did not report on 
experiments. Ortiz and Kaelbling [17] conducted experiments with PEM and 
CGEM for continuous models, namely density estimation with a mixture of 
Gaussian. Their results generally favor CGEM over PEM, although PEM can 
be superior for some domains. Finally, there are several acceleration techniques 
of the EM which do not apply conjugate gradients and have not reached much 
attention within Bayesian network learning. We refer to [8,13] for a full ac- 
count of them. Among this work, Lange [10] is closest to SCGEM. He proposed 
the expected information matrix as approximation of the Hessian together with 
a symmetric, rank-one update of the complete approximation matrix within a 
quasi-Newton technique. An interesting but orthogonal research discussed in [13] 
investigates criterions when to switch to CGEM. This is a promising research 
direction for SCGEM. 

To conclude, we introduced scaled conjugate gradient EM for maximum like- 
lihood parameter estimation of Bayesian networks. They overcome the expensive 
line search of traditional conjugate gradient EM. To the best of our knowledge, 
it is the first time that one reports on an accelerated EM for Bayesian networks 
which exhibits the same number of likelihood estimations per iteration then the 
EM. The experiments show that SCGEM is equal with EM and CGEM in qual- 
ity but can significantly accelerate both. As predicted by theory, SCGEM seems 
especially well suited in the presence of many latent parameters, i.e., ‘EM-hard’ 
instances. The main future question seems to be the disarming of the scaling 
traps. 
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Abstract. We consider supervised learning of a ranking function, which is a 
mapping from instances to total orders over a set of labels (options). The training 
information consists of examples with partial (and possibly inconsistent) infor- 
mation about their associated rankings. From these, we induce a ranking function 
by reducing the original problem to a number of binary classification problems, 
one for each pair of labels. The main objective of this work is to investigate the 
trade-off between the quality of the induced ranking function and the computa- 
tional complexity of the algorithm, both depending on the amount of preference 
information given for each example. To this end, we present theoretical results 
on the complexity of pairwise preference learning, and experimentally investi- 
gate the predictive performance of our method for different types of preference 
information, such as top-ranked labels and complete rankings. The domain of 
this study is the prediction of a rational agent’s ranking of actions in an uncertain 
environment. 



1 Introduction 

The problem of learning with or from preferences has recently received a lot of attention 
within the machine learning literature^ The problem is particularly challenging because 
it involves the prediction of complex structures, such as weak or partial order relations, 
rather than single values. Moreover, training input will not, as it is usually the case, 
be offered in the form of complete examples but may comprise more general types of 
information, such as relative preferences or different kinds of indirect feedback. 

More specifically, the learning scenario that we will consider in this paper consists 
of a collection of training examples which are associated with a finite set of decision 
alternatives. Following the common notation of supervised learning, we shall refer to 
the latter as labels. However, contrary to standard classification, a training example is 
not assigned a single label, but a set of pairwise preferences between labels, expressing 
that one label is preferred over another. 

The goal is to use these pairwise preferences for predicting a total order, a ranking, 
of all possible labels for a new training example. More generally, we seek to induce a 

* Space restrictions prevent a thorough review of related work in this paper, but we refer the 
reader to (Fiimkranz and Hiillermeier, 2003). 

N. Lavrac et al. (Eds.): ECML 2003, LNAI 2837, pp. 145-156, 2003. 
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ranking function that maps instances (examples) to rankings over a fixed set of decision 
alternatives (labels), in analogy to a classification function that maps instances to single 
labels. To this end, we investigate the use of round robin learning or pairwise classifi- 
cation. As will be seen, round robin appears particularly appealing in this context since 
it can be extended from classification to preference learning in a quite natural manner. 

The paper is organized as follows: In the next section, we introduce the learning 
problem in a formal way. The extension of pairwise classification to pairwise preference 
learning and its application to ranking are discussed in section 3. Section 4 provides 
some results on the computational complexity of pairwise preference learning. Results 
of several experimental studies investigating the predictive performance of our approach 
under various training conditions are presented in section 5. We conclude the paper with 
some final remarks in section 6. 

2 Learning Problem 

We consider the following learning problem: 



Given: 

- a set of labels L = {\i\i = 1 . . . c\ 

- a set of examples E = {cfc | fc = 1 . . . n} 

- for each training example Cfe: 

• a set of preferences Pk Q L x L, where (Ai, Xj) G Pk indicates that label 
Xi is preferred over label Xj for example Cfc (written as Ai Xj) 

Find: a function that orders the labels Xi,i = 1 . . . c for any given example. 



This setting has been previously introduced as constraint classification by Har- 
Peled et al. (2002). As has been pointed out in their work, the above framework is a 
generalization of several common learning settings, in particular (see ibidem for a for- 
mal derivation of these and other results) 

- ranking: Each training example is associated with a total order of the labels, i.e., 
for each pair of labels {Xi, Xj) either Ai Xj or Xj Xi holds. 

- classification: A single class label Xi is assigned to each example. This implicitly 
defines the set of preferences {Xi Xj \1 < j i < c}. 

- multi-label classification: Each training example is associated with a subset 
Sk ^ L of possible labels. This implicitly defines the set of preferences 
{Ai Xj I Xi G Sk, Xj G L \ Sk}. 

As pointed out before, we will be interested in predicting a ranking (total order) 
of the labels. Thus, we assume that for each instance, there exists a total order of the 
labels, i.e., the pairwise preferences form a transitive and asymmetric relation. Eor many 
practical applications, this assumption appears to be acceptable at least for the true 
preferences. Still, more often than not the observed or revealed preferences will be 
incomplete or inconsistent. Therefore, we do not require the data to be consistent in 
the sense that transitivity and asymmetry applies to the Pk. We only assume that Pk is 
irreflexive (Ai )/- Xi) and anti-symmetric (A^ Xj Xj f- Ai). (Note that 0 < \Pk\ < 
c(c — l)/2 as a consequence of the last two properties.) 
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3 Pairwise Preference Ranking 

A key idea of our approach is to learn a separate theory for each of the c(c — l)/2 
pairwise preferences between two labels. More formally, for each possible pair of labels 
(Ai, Aj), 1 < i < j < c, we learn a model My that decides for any given example 
whether Ai Xj or Xj >- Xi holds. The model is trained with all examples Cfc for which 
either Xi>k Xj or Xj >k Ai is known. All examples for which nothing is known about 
the preference between Xi and Xj are ignored. 

At classification time, an example is submitted to all c(c— 1)/2 theories. If classifier 
Mij predicts Xi >- Xj , we count this as a vote for Xi . Conversely, the prediction Xj >- Xi 
would be considered as a vote for Xj . The labels are ranked according to the number of 
votes they receive from all models Mij. Ties are first broken according to the frequency 
of the labels in the top rank (the class distribution in the classification setting) and then 
randomly. 

We refer to the above technique as pairwise preference ranking or round robin rank- 
ing. It is a straight-forward generalization of pairwise or one-against-one classification, 
aka round robin learning, which solves multi-class problems by learning a separate 
theory for each pair of classes. In previous work, Fiirnkranz (2002) showed that, for 
rule learning algorithms, this technique is preferable to the more commonly used one- 
against-all classification method, which learns one theory for each class, using the ex- 
amples of this class as positive examples and all others as negative examples. Interest- 
ingly, despite its complexity being quadratic in the number of classes, the algorithm is 
no slower than the conventional one-against-all technique (Fiirnkranz, 2002). We will 
generalize these results in the next section. 



4 Complexity 

Consider a learning problem with n training examples and c labels. 

Theorem 1. The total number of training examples over all c(c — 1) /2 binary prefer- 
ence learning problems is 

^ |Pfc| < nmax|Pfe| < 

Proof. Each of the n training examples Cfc will be added to all \Pk \ binary training sets 
that correspond to one of its preferences Xi >k Xj. Thus, the total number of training 
examples is X]fc=i |Pfel- number of preferences for each example is bounded 

from above by max^ \Pk\, this number is no larger than n max^ \Pk\, which in turn is 
bounded from above by the size of a complete set of preferences nc{c — l)/2. □ 

Corollary 1. (Fiirnkranz, 2002) For a classification problem, the total number of train- 
ing examples is only linear in the number of classes. 

Proof. A class label expands to c — 1 preferences, therefore X]fc=i |Pfe| = (c ~ 
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Note that we only considered the number of training examples, but not the complex- 
ity of the learner that runs on these examples. For an algorithm with a linear run-time 
complexity 0(n) it follows immediately that the total run-time is 0(dn), where d is the 
maximum (or average) number of preferences given for each training example. For a 
learner with a super-linear complexity 0(n“), a > 1, the total run-time is much lower 
than 0((dn)“) because the training effort is not spent on one large training set, but on 
many small training sets. In particular, for a complete preference set, the total complex- 
ity is 0(c^n“), whereas the complexity for d = c — 1 (round robin classification) is 
only 0(cn“) (Fiirnkranz, 2002). 

For comparison, the only other technique for learning in this setting that we know 
of (Har-Peled et al., 2002) constructs twice as many training examples (one positive and 
one negative for each preference of each example), and these examples are projected 
into a space that has c times as many attributes as the original space. Moreover, all 
examples are put into a single training set for which a separating hyper-plane has to 
be found. Thus, under the (reasonable) assumption that an increase in the number of 
features has approximately the same effect as a corresponding increase in the number 
of examples, the total complexity becomes 0((cdn)“) if the algorithm for finding the 
separating hyper-plane has complexity 0{n°‘) for a two-class training set of size n. 

In summary, the overall complexity of pairwise constraint classification depends on 
the number of known preferences for each training example. While being quadratic in 
the number of labels if a complete ranking is given, it is only linear for the classifica- 
tion setting. In any case, it is more efficient than the technique proposed by Flar-Peled 
et al. (2002). Flowever, it should be noted that the price to pay is the large number of 
classifiers that have to be stored and tested at classification time. 



5 Empirical Results 

The previous sections have shown that round robin learning can be extended to induce a 
ranking function from a set of preferences instead of a single label. Yet, it turned out that 
computational complexity might become an issue. Especially, since a ranking induces 
a quadratic number of pairwise preferences, the complexity for round robin ranking be- 
comes quadratic in the number of labels. In this context, one might ask whether it could 
be possible to improve efficiency at the cost of a tolerable decrease in performance: 
Could the learning process perhaps ignore some of the preferences without decreasing 
predictive accuracy too much? Apart from that, incomplete training data is clearly a 
point of practical relevance, since complete rankings will rarely be observable. 

The experimental evaluation presented in this section is meant to investigate issues 
related to incomplete training data in more detail, especially to increase our under- 
standing about the trade-off between the number of pairwise preferences available in 
the training data and the quality of the learned ranking function. For a systematic in- 
vestigation of questions of such kind, we need data for which, in principle, a complete 
ranking is known for each example. This information allows a systematic variation of 
the amount of preference information in the training data, and a precise evaluation of 
the predicted rankings on the test data. Since we are not aware of any suitable real-world 
datasets, we have conducted our experiments with synthetic data. 
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5.1 Synthetic Data 

We consider the problem of learning the ranking function of an expected utility maxi- 
mizing agent. More specifically, we proceed from a standard setting of expected utility 
theory: A = {ai, . . . ,ac} is a set of actions the agent can choose from and 17 = 
{tui, . . . , ujm} is a set of world states. The agent faces a problem of decision under risk 
where decision consequences are lotteries: Choosing act ai in state ujj yields a utility 
of Uij G K, where the probability of state ujj is pj . Thus, the expected utility of act ai 
is given by 

m 

E(ai) = '^Pj ■ Uij. (1) 

i=i 

Expected utility theory justifies (1) as a criterion for ranking actions and, hence, gives 
rise to the following preference relation: 

ai >~ aj 44- E(ai) > Eja^). (2) 

Now, suppose the probability vector p = (pi , . . . , p^) to be a parameter of the decision 
problem (while A, 17 and the utility matrix matrix U = (uij) are fixed). 

The above decision-theoretic setting can be used for generating synthetic data for 
preference learning. The set of instances corresponds to the set of probability vectors p, 
which are generated at random according to a uniform distribution over {p G K™ | p > 
0, Pi -f . . .+Pm = !}■ The ranking function associated with an example is given by the 
ranking defined in (2). Thus, an experiment is characterized by the following parame- 
ters: The number of actions/labels (c), the number of world states (to), the number of 
examples (n), and the utility matrix which is generated at random through independent 
and uniformly distributed entries Uij G [0, 1]. 

5.2 Experimental Setup 

In the following, we will report on results of experiments with ten different states (to = 
10) and various numbers of labels (c = 5, 10, 20). For each of the three configurations 
we generated ten different data sets, each one originating from a different randomly 
chosen utility matrix U. The data sets consisted of 1000 training and 1000 test examples. 
For each example, the data sets provided the probability vector p G and a complete 
ranking of the c possible actions^. The training examples were labeled with a subset 
of the complete set of pairwise preferences as imposed by the ranking in the data set. 
The subsets that were selected for the experiments are described one by one for the 
experiments. 

We used the decision tree learner C4.5 (Quinlan, 1993) in its default settings^ to 
learn a model for each pairwise preference. For all examples in the test set we ob- 
tained a final ranking using simple voting and tie breaking as described in section 3. 

^ The occurrence of actions with equal expected utility has probability 0. 

^ Our choice of C4.5 as the learner was solely based on its versatility and wide availability. If 
we had aimed at maximizing performance on this particular problem, we would have chosen 
an algorithm that can directly represent the separating hyperplanes for each binary preference. 
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The predicted ranks were then compared with the actual ranks. Our primary evalua- 
tion measures were the error rate of the top rank (for comparing classifications) and the 
Spearman rank correlation coefficient (for comparing complete rankings). 

5.3 Ranking vs. Classification 

Figure 1 shows experimental results for (a) using the full set of c(c — l)/2 pairwise 
preferences, (b) for the classification setting which uses only the c — 1 preferences that 
involve the top label, and (c) for the complementary setting with the (c — 1) (c — 2)/2 
preferences that do not involve the top label. There are several interesting things to 
note for these results. First, the difference between the error rates of the classification 
and the ranking setting is comparably small. Thus, if we are only interested in the top 
rank, it may often suffice to use the pairwise preferences that involve the top label. The 
advantage in this case is of course the reduced complexity which becomes linear in the 
number of labels. On the other hand, the results also show that the complete ranking 
information can be used to improve classification accuracy, at least if this information 
is available for each training example and if one is willing to pay the price of a quadratic 
complexity. 

The results for the complementary setting show that the information of the top rank 
preferences is crucial: When dropping this information and using only those pairwise 
preferences that do not involve the top label, the error rate on the top rank increases 
considerably, and is much higher than the error rate for the classification setting. This 
is a bit surprising if we consider that in the classification setting, the average number 
of training examples for learning a model is much smaller than in the complemen- 
tary setting. Interestingly, the effective number of training examples for the top labels 
might nevertheless decrease. In fact, in our learning scenario we will often have a few 
dominating actions whose utility degrees are systematically larger than those of other 
actions. In the worst case, the same action is optimal for all probability vectors p, and 
the complementary set will not contain any information about it. While this situation 
is of course rather extreme, the class distribution is indeed very unbalanced in our sce- 
nario. For example, we determined experimentally for c = to = 10 and n = 1000 that 
the probability of having the same optimal action for more than half of the examples is 
« 2/3, and that the expected Gini-index of the class distribution is « 1/2. 

With respect to the prediction of complete rankings, the performance for learning 
from the complementary set of preferences is almost as good as the performance for 
learning from the complete set of preferences, whereas the performance of the ranking 
induced from the classification setting is considerably worse. This time, however, the 
result is hardly surprising and can easily be explained by the amount of information 
provided in the two cases. In fact, the complementary set determines the ranking of c— 1 
among the c labels, whereas the top label alone does hardly provide any information 
about the complete ranking. 

As another interesting finding note that the classification accuracy decreases with 
an increasing number of labels, whereas the rank correlation increases (this is also re- 
vealed by the curves in Figure 3 below). In other words, the quality of the predicted 
rankings increases, even though the quality of the predictions for the individual ranks 
decreases. This effect can first of all be explained by the fact that the (classification) 
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prefs 


error 


rank corr. 


5 


ranking 

classification 

complement 


13.380 ± 8.016 
14.400 ± 8.262 
32.650 ± 14.615 


0.907 ± 0.038 
0.783 ± 0.145 
0.872 ± 0.051 


10 


ranking 

classification 

complement 


15.820 ± 8.506 
16.670 ± 9.549 
24.310 ± 9.995 


0.940 ± 0.018 
0.71 1 ± 0.108 
0.937 ± 0.018 


20 


ranking 

classification 

complement 


24.030 ± 4.251 
26.370 ± 5.147 
32.300 ± 3.264 


0.966 ± 0.004 
0.697 ± 0.066 
0.966 ± 0.004 



Fig. 1. Comparison of ranking (a complete set 
of preferences is given) vs. classification (only 
the preferences for the top rank are given). 
Also shown are the results for the complemen- 
tary setting (all preferences for the top rank are 
omitted). 




Fig. 2. Expected Spearman rank correlation as 
a function of the number of labels if all models 
Mij have an error rate of e (curves are shown 
fore = 0.1, 0.2, 0.3, 0.4, 0.5). 



error is much more affected by an increase of the number of labels. As an illustration, 
consider random guessing: The chances of guessing the top label correctly are 1/m, 
whereas the expected value of the rank correlation is 0 regardless of m. Moreover, one 
might speculate that the importance of a correct vote of each individual model Mij 
decreases with an increasing number of labels. Roughly speaking, incorrect classifica- 
tions of individual learners are better compensated on average. This conjecture is also 
supported by an independent experiment in which we simulated a set of homogeneous 
models My through biased coin flipping with a prespecified error rate. It turned out 
that the quality measures for predicted rankings tend to increase if the number of labels 
becomes large (though the dependence of the measures on the number of labels is not 
necessarily monotone, see Fig. 2). 



5.4 Missing Preferences 

While the previous results shed some light on the trade-off between utility and costs 
for two special types of preference information, namely top-ranked labels and complete 
rankings, they do not give a satisfactory answer for the general case. The selected set of 
preferences in the classification setting is strongly focused on a particular label for each 
example, thus resulting in a very biased distribution. In the following, we will look at 
the quality of predicted rankings when selecting random subsets of pairwise preferences 
from the full sets with equal right. 

Figure 3 shows the curves for the classification error in the top rank and the average 
Spearman rank correlation of the predicted and the true ranking over the number of 
preferences. To generate these curves, we started with the full set of preferences, and 
ignored increasingly larger fractions of it. This was implemented with a parameter pi 
that caused any given preference in the training data to be ignored with probability pi 
(100 X Pi is plotted on the a;-axis). 
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Fig. 3. Average error rate (left) and Spearman rank correlation (right) for various percentages of 
ignored preferences. The error bars indicate the standard deviations. The vertical dotted lines on 
the right indicate the number of preferences for classification problems (for 5,10, and 20 classes), 
those on the left are the complementary sizes. 



The similar shape of the three curves (for 5, 10, and 20 labels) suggests that the 
decrease in the ranking quality can be attributed solely to the missing preferences while 
it seems to be independent of the number of labels. In particular, one is inclined to 
conclude that — contrary to the case where we focused on the top rank — it is in general 
not possible to reduce the number of training preferences by an order of magnitude 
(i.e., from quadratic to linear in the number of labels) without severely decreasing the 
ranking quality. This can also be seen from the three dotted vertical lines in the right 
half of the graphs. These lines indicate the percentage of preferences that were present 
in the classification setting for 5, 10, and 20 labels (from inner-most to outer-most). A 
comparison of the error rates, given by the intersection of a line with the corresponding 
curve, to the respective error rates in Figure 1 shows an extreme difference between 
the coincidental selection of pairwise preferences and the systematic selection which is 
focused on the top rank. 

Nevertheless, one can also see that about half of the preferences can be ignored 
while still maintaining a reasonable performance level. Even though it is quite com- 
mon that learning curves are concave functions of the size of the training set, the de- 
scent in accuracy appears to be remarkably flat in our case. One might be tempted 
to attribute this to the redundancy of the pairwise preferences induced by a ranking: 
In principle, a ranking p could already be reconstructed from the c — 1 preferences 
Pi P2, ■ ■ ■ , Pc-1 >- Pc, which means that only a small fraction of the pairwise pref- 
erences are actually needed. Still, one should be careful with this explanation. First, we 
are not trying to reconstruct a single ranking but rather to solve a slightly different prob- 
lem, namely to learn a ranking function. Second, our learning algorithm does actually 
not “reconstruct” a ranking as suggested above. In fact, our simple voting procedure 
does not take the dependencies between individual models My into account, which 
means that these models do not really cooperate. On the contrary, what the voting pro- 
cedure exploits is just the redundancy of preference information: The top rank is the 
winner only because it is preferred in c — 1 out of the c(c — l)/2 pairwise comparisons. 

Finally, note that the shape of the curves probably also depends on the number of 
training examples. We have not yet investigated this issue because we were mainly 
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Fig. 4. Average Spearman rank correlation over various percentages of random preferences. The 
error bars indicate the standard deviations. The solid thin lines are the curves for ignored prefer- 
ences (Figure 3). 



interested in the possibility of reducing the complexity by more than a constant factor 
without losing too much of predictive accuracy. It would be interesting, for example, to 
compare (a) using p% of the training examples with full preferences and (b) using all 
training examples with p% of the pairwise preferences. 



5.5 Mislabeled Preferences 

Recall that our learning scenario assumes preference structures to be complete rank- 
ings of labels, that is transitive and asymmetric relations. As already pointed out, we do 
not make this assumption for observed preferences: First, we may not have access to 
complete sets of preferences (the case studied in the previous section). Second, the pro- 
cess generating the preferences might reproduce the underlying total order incorrectly 
and, hence, produce inconsistent preferences. The latter problem is quite common, for 
example, in the case of human judgments. 

To simulate this behavior, we adopted the following model: Proceeding from the 
pairwise preferences induced by a given ranking, a preference Xi >- Xj was kept with 
probability 1 — ps, whereas with probability ps, one of the preferences Xi >- Xj and 
Xj >- Xi was selected by a coin flip. Thus, in approximately Ps/2 cases, the preference 
will point into the wrong direction^. For ps = 0, the data remain unchanged, whereas 
the preferences in the training data are completely random for p^ = 1. 

Figure 4 shows the average Spearman rank correlations that were observed in this 
experiment. Note that the shape of the curve is almost the same as the shape of the 
curves for ignored preferences. It is possible to directly compare these two curves be- 
cause in both graphs a level of n% means that 100 — n% of the preferences are still 
intact. The main difference is that in Figure 3, the remaining n% of the preferences 
have been ignored, while in Figure 4 they have been re-assigned at random. To facili- 

In fact, we implemented the procedure by selecting Ps/2 preferences and reversing their sign. 
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tate this comparison, we plotted the curves for ignored preferences (the same ones as in 
Figure 3) into the graph (with solid, thin lines). 

It is interesting to see that in both cases the performance degrades very slowly at 
the beginning, albeit somewhat steeper than if the examples are completely ignored. 
Roughly speaking, completely omitting a pairwise preference appears to be better than 
including a random preference. This could reasonably be explained by the learning 
behavior of a classifier If My does already perform well, an additional correct 
example will probably be classified correctly and thus improve My only slightly (in 
decision tree induction, for example, My will even remain completely unchanged if 
the new example is classified correctly). As opposed to this, an incorrect example will 
probably be classified incorrectly and thus produce a more far-reaching modification 
of My (in decision tree induction, an erroneous example might produce a completely 
different tree). All in all, the “expected benefit” of My caused by a random preference 
is negative, whereas it is 0 if the preference is simply ignored. 

From this consideration one may conclude that a pairwise preference should better 
be ignored if it is no more confident than a coin flip. This can also be grasped intuitively, 
since the preference does not provide any information in this case. If it is more confi- 
dent, however, it clearly carries some information and it might then be better to include 
it, even though the best way of action will still depend on the number and reliability of 
the preferences already available. Note that our experiments do not suggest any strat- 
egy for deciding whether or not to include an individual preference, given information 
about the uncertainty of that preference. In our case, each preference is equally uncer- 
tain. Thus, the only reasonable strategies are to include all of them or to ignore the 
complete sample. Of course, the first strategy will be better as soon as the probability 
of correctness exceeds 1/2, and this is also confirmed by the experimental results. For 
example, the correlation coefficient remains visibly above 0.8 even if 80% of the pref- 
erences are assigned by chance and, hence, the probability of a particular preference to 
be correct is only 0.6. One may conjecture that pairwise preference ranking is partic- 
ularly robust toward noise, since an erroneous example affects only a single classifier 
My which in furn has a limited influence on fhe eventually predicted ranking. 



6 Concluding Remarks 

We have introduced pairwise preference learning as an extension of pairwise classi- 
fication to constraint classification, a learning scenario where training examples are 
labeled with a preference relation over all possible labels instead of a single class label 
as in the conventional classification setting. From this information, we also learn one 
model for each pair of classes, but focus on learning a complete ranking of all labels 
instead of only predicting the most likely label. Our main interest was to investigate 
the trade-off between ranking quality and the amount of training information (in terms 
of the number of preferences that are available for each example). We experimentally 
investigated this trade-off by varying parameters of a synthetic domain that simulates 
a decision-theoretic agent which ranks its possible actions according to an unknown 
utility function. Roughly speaking, the results show that large parts of the information 
about pairwise preferences can be ignored in round robin ranking without losing too 
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much predictive performance. In the classification setting, where one is only interested 
in predicting the top label, it also turned out that using the full ranking information 
rather than restricting to the pairwise preferences involving the top label does even im- 
prove the classification accuracy, suggesting that the lower ranks do contain valuable 
information. For reasons of efficiency, however, it might still be advisable to concen- 
trate on the smaller set of preferences, thereby reducing the size of the training set by 
an order of magnitude. 

The main limitation of our technique is probably the assumption of having enough 
training examples for learning each pairwise preference. For data with a very large 
number of labels and a rather small set of preferences per example, our technique will 
hardly be applicable. In particular, it is unlikely to be successful in collaborative fil- 
tering problems (Goldberg et ah, 1992; Resnick and Varian, 1997; Breese et al., 1998), 
although these can be mapped onto the constraint classification framework in a straight- 
forward way. A further limitation is the quadratic number of theories that has to be 
stored in memory and evaluated at classification time. However, the increase in mem- 
ory requirements is balanced by an increase in computational efficiency in comparison 
to the technique of Har-Peled et al. (2002). In addition, pairwise preference learning in- 
herits many advantages of pairwise classification, in particular its implementation can 
easily be parallelized because of its reduction to independent subproblems. Finally, we 
have assumed an underlying total order of the items which needs to be recovered from 
partial observations of preferences. However, partial orders (cases where several labels 
are equally preferred) may also occur in practical applications. We have not yet inves- 
tigated the issue of how to generate (and evaluate) partial orders from learned pairwise 
predictions. Similarly, our current framework does not provide a facility for discrimi- 
nating between cases where we know that a pair of labels is of equal preference and 
cases where we don’t know anything about their relative preferences. 

There are several directions for future work. First of all, it is likely that the predic- 
tion of rankings can be improved by combining the individual models’ votes in a more 
sophisticated way. Several authors have looked at techniques for combining the predic- 
tions of pairwise theories into a final ranking of the available options. Proposals include 
weighting the predicted preferences with the classifiers’ confidences (Fiirnkranz, 2003) 
or using an iterative algorithm for combining pairwise probability estimates (Hastie and 
Tibshirani, 1998). However, none of the previous works have evaluated their techniques 
in a ranking context, and some more elaborate proposals, like error-correcting output 
decoding (Allwein et al., 2000), organizing the pairwise classifiers in a tree-like struc- 
ture (Platt et al., 2000), or using a stacked classifier (Savicky and Fiirnkranz, 2003) are 
specifically tailored to a classification setting. Taking into account the fact that we are 
explicitly seeking a ranking could lead to promising alternatives. For example, we are 
thinking about selecting the ranking which minimizes the number of predicted prefer- 
ences that need to be reversed in order to make the predicted relation transitive. Depart- 
ing from the counting of votes might also offer possibilities for extending our method 
to the prediction of preference structures more general than rankings (total orders), 
such as weak preference relations where some of the labels might not be comparable. 
Apart from theoretical considerations, an important aspect of future work concerns the 
practical application of our method and its evaluation using real-world problems. Un- 
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fortunately, real-world data sets that fit our framework seem to be quite rare. In fact, 
currently we are not aware of any data set of significant size that provides instances in 
attribute-value representation plus an associated complete ranking over a limited num- 
ber of labels. 
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Abstract. We present in this paper a method to introduce a priori 
knowledge into reinforcement learning using temporally extended ac- 
tions. The aim of our work is to reduce the learning time of the 
Q-learning algorithm. This introduction of initial knowledge is done by 
constraining the set of available actions in some states. But at the same 
time, we can formulate that if the agent is in some particular states 
(called exception states), we have to relax those constraints. We define a 
mechanism called the propagation mechanism to get out of blocked situ- 
ations induced by the initial knowledge constraints. We give some formal 
properties of our method and test it on a complex grid-world task. On 
this task, we compare our method with Q-learning and show that the 
learning time is drastically reduced for a very simple initial knowledge 
which would not be sufficient, by itself, to solve the task without the 
definition of exception situations and the propagation mechanism. 



1 Introduction 

Reinforcement Learning is a general framework in which an autonomous 
agent learns which actions to choose in particular situations (states) in order to 
optimize some reinforcements (rewards or punitions) in the long run [1]. A fun- 
damental problem of its standard algorithms is that although many tasks can be 
formulated in this framework, in practice for large state space they are not solv- 
able in reasonable time. There are two principal approaches for addressing these 
problems: The first approach is to apply generalization techniques (e.g., [2,3]). 
The second approach is to use temporally extended actions (e.g., [4,5, 6, 7, 8, 9]). 
A temporally extended action is a way of grouping actions to create a new one. 
For example, if the primitive actions of a problem are “make a step in a given 
direction” , a temporally extended action could be “make ten steps to the north 
followed by two steps to the west”. Temporally extended actions represent the 
problem at different levels of abstraction. 

The aim of our work is to give a method to incorporate easily some a priori 
knowledge, about a task we try to solve by reinforcement learning, to speed- 
up the learning time. To introduce knowledge into reinforcement learning, we 
use some temporally extended actions for which the set of available actions can 
change during learning. We try to reduce the blind exploration of the agent by 
constraining the set of available actions. But because the a priori knowledge 
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could be very simple, those constraints can make the agent unable to solve a 
task. So we define a way to relax those constraints (with what we call the ex- 
ception conditions and the propagation mechanism). The structure of this paper 
is as follows. First we described our method in section 2. We give its two main 
properties in section 3. In section 4 and 5 we describe a complex grid-world task 
to compare our method with Q-learning [10]. We show that the learning time is 
drastically reduced for a very simple initial knowledge which is not sufficient by 
itself, to solve the task and so, must be updated. 

2 Formalism 

In this section we develop our method which we call EBRL for Exception-Based 
Reinforcement Learning. To make it easier for the reader we explain it with the 
help of the artificial problem presented in Figure 1. In this grid-world, the agent 
has to reach the cross; he can move in eight directions (north, north-east, . . . ) 
and some walls can be put in the grid (the agent is blocked by them). 




Fig. 1. The agent (triangle) has to reach the cross to solve the task. 



2.1 Procedure, Rule and Exception 

We define in this sub-section the syntax in which our temporally extended ac- 
tions will be written. The semantic associated with this syntax is also explained. 
We represent a temporally extended action by a procedure: 

Procedure_name (state ,list_of_parameters) — >■ 
termination : termination condition 
rule : set of actions Si 

{exception : ( exception condition, set of actions S2)}*^^^ 
next : continuation 

The semantic associated with this syntax is: 

— state: the state of the underlying Markov Decision Process (MDP); 

— list_of _parameters: optional parameters, each parameter has a finite num- 
ber of different possible values; 

— termination: this is the condition of termination of the procedure. This 
condition only depends on the state and the parameters; 
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— rule: this rule produces a finite set of primitive or temporally extended 
actions (procedures). This rule is applied only if the exception condition is 
not fulfilled; 

— exception: 

• exception condition: if this condition is fulfilled, we do not use the set of 
actions of the rule part, but instead, the set of actions produced by the 
exception part; 

• set of actions: a finite set of primitive or temporally extended 
actions (procedures). This rule is applied only if the exception condi- 
tion is fulfilled. We have Si Q S 2 ] 

— next: continuation after the execution of an action of the rule or exception 
part. This part is a call to another procedure. If this procedure has pa- 
rameters, they only depend on the state and the parameters of the current 
procedure. 

When entering a procedure, we first test the termination condition; if it is 
not fulfilled, we test the exception condition; if this condition is true, we choose 
one of the exception actions and after its execution, we continue in the next 
part. If the exception condition is not fulfilled, we choose one of the action of 
the rule part and after its carrying out, we continue in the next part. In the 
remaining of the article, we call program a finite set of procedures and main the 
first procedure of the program to be executed. 

2.2 Example 

We illustrate, in this section, the syntax described above, to solve the artificial 
problem. 

main (grid configuration) — > 
termination : the agent is on the cross 

rnle : { the set of actions which make the agent get closer to the cross } 

exception : (all the actions of the rule part lead to a wall, { all the primitive actions } ) 

next : mainO 

The a priori knowledge is just to choose in each state of the underlying MDP 
the set of actions (amongst the eight possible ones) which gets the agent closer to 
the cross without taking the walls into account (there is between one and three 
such actions) . The exception to this rule is when all those actions lead the agent 
to a wall; when this is the case, we relax the constraints and allow all actions. 

2.3 Full State Representation 

We use a program in interaction with an MDP. This program will help the learn- 
ing agent to solve the problem represented by this MDP. We have seen that each 
procedure has the state of the underlying MDP as a parameter. We define the 
full state representation of a procedure as a 4-tuple (procedure_name , state , 
list_of _parcuneters , next_list) where procedure_name is the name of the 
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procedure, state is the state of the underlying MDP when we enter this pro- 
cedure, list_of _parameters is the possible parameters of the procedure and 
next_list is the list of procedures to be executed after the execution of 
procedure_name. 



2.4 Induced Semi-Markov Decision Process 



In this section we describe an algorithm called construct-SMDP which constructs 
a Semi-Markov Decision Process (SMDP) (see [11] and [5]), from a program V 
and an underlying MDP Ai (similar construction can be found in [4] and [5]). 
Note that this algorithm serves to demonstrate that the execution of a program 
on an MDP is an SMDP. We will never have to construct explicitly this SMDP 
when executing a program on an MDP. In the remaining of this article we note: 



— M = {S,A,T,TZ) the underlying MDP where 5 is a set of states, A is 
a set of actions, T is a Markovian transition model mapping 5 x M x 5 
into probabilities in [0, 1], 7^ is a reward function mapping 5 x M x 5 into 
real- valued rewards; 

— V the program; 

— Mairi-parameters the set of all possible lists of values for the parameter list 
of the main procedure; 

— Al the set of all the actions of the rule and exception part of all the proce- 
dures of the program and, for the temporally extended actions, all possible 
instantiations of their parameters; 

— A' {s'), where s' = {p,s,l,n) is a full state representation, is the set of all 
actions of the rule and exception part of the procedure p. If the terminal 
condition of p is fulfilled, M'(s') = termination; 

— A'r{s'), where s' = (p, s, /, n) is the set of all actions of the rule part of the 
procedure p; 

— A'e(s') where s' = (p, s, I, n) is the set of all actions of the exception part of 
the procedure p, if there is no exception part, A'e{s') = A'r{s'). We recall 
that A'r{s') C M'e(s'); 

— next(p, s, /), where p is a procedure, s a state of Ai and I a list of parameters, 
is the procedure of the next part of p, with its instantiated parameters. 

— add(e, /) returns the list with first element e and tail 1. head(/) returns the 
first element of / and tail(Z) returns the list I without its first element. 

A state of the constructed SMDP is a full state representation of a procedure. 
The SMDP {S' ,A' ,T' ,TZ' , P) (where S', Al , T' , TZ' have the same meaning as 
S, A, T and TZ respectively and /3 is a mapping from 5' x x S' into real value 
-see [5]-) is constructed as follows: 
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algorithm construct— SMDP(MDP : At, program : V, 

discount factor : 7 S] 0 , 1 [) 

begin 

S' ■«— {main} x S X Main-parameters x {(]} 

repeat 

forall t = {p, s, I, n) ^ S' , forall a 6 -A' (t) do 
if a = p'{l') then 

t' ^ (p' , s,l' ,a,dd{next(p, s, l),n)) 

s' -i — s' u {t^} 

'7~'{t,a,t'^ -i — 1 
P{t, a, t') 1 

elseif a = termination and p ^ main then 
t' •<— next— state(s,n) 

s' ^ — s' u {t^} 

a,t') 1 

/3{t,a,t') ^ 1 
elseif a Q A. then 

forall s' € S do 

n' ■*— add(next(p, s, Z), n) 
t' ■<— next— state(s',n') 

s' i — s' u {t^} 

1~' (t, a,t') l~{s, a, s') 

TZ'\t,a,t') ■<— a, s') 

/3(t, a, t') •<— 7 

endforall 

endif 

endforall 

until s' is stable (means that no new state 
has been added to S') 

All unspecified value for T', Tl' and ^ are set to 0 

return (<S' , ,4.' , T”' , 7?-' , ,0) 

end 



algorithm next— state(s : s 6 5, n : next-list) 
begin 

if n. ^ [] then 

let p(Z) = head(n) 
t ■<— (p, s, Z,tail(n)) 
return t 

else 

in this case, the program is not correct, 
the program is not terminated and there is 
no next procedure to call 

endif 

end 



We only consider program V and MDP JH for which the construct-SMDP 
algorithm terminates. The construct-SMDP satisfies the definition of an SMDP. 
The Markov property is preserved because of the full state representation. This 
SMDP is what the agent faces when executing V in Ai. Note that we do not 
discount by 1 when calling a temporally extended action (procedure) and the 
immediate reward is zero. Moreover, we discount by 7 and receive the imme- 
diate reward of the underlying MDP when executing a primitive action so, the 
solution of the constructed SMDP defines an optimal policy that maximizes the 
expected discounted sum of rewards (with discount factor 7) received by the 
agent executing V in Ai. Note that the optimal policy in the SMDP can be 
sub-optimal in the underlying MDP. 

Proposition 1 The construct-SMDP algorithm terminates if the MDP Ai is 
finite and the next_list of all possible full state representation is finite. 

Proof: By definition of a procedure. A' is finite and if A4 is finite and the 
next_list is finite for all full state representation, there is only a finite number 
of possible full state representation so, after a finite number of the repeat loop, 
no more new state will be added to S'. □ 

Definition 1 A procedure p is said to create next-list cycle if and only if p can 
be reached by a temporally extended action of its rule or exception part. 

Proposition 2 A sufficient condition to insure finite next_list is that there 
does not exist next-list cycle procedures. 

Proof: The next_list grows only when we choose in a procedure p a temporally 
extended action. If we do not have next Jist cycle procedures, in the next part of 
p we cannot call a procedure already in next_list (else a temporally extended 
action of p has reached p) and because the program is composed of a finite 
number of procedures, the next_list cannot grow indefinitly. □ 
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b) exception states 



Fig. 2. A wall is located between the agent and the cross. 




a) grid configuration 



b) exception states 



Fig. 3. Without propagation mechanism, the agent cannot escape of the dead-end. 



2.5 Exception State and Propagation 

We call direct exception state a state for which the exception condition is fulfilled. 
For example for the artificial problem and the program described in section 2.2, 
the state (main, (3,3) ,[],[] ) (we represent the grid configuration only by the 
agent’s position, see Figure 2 (a)) is said to be a direct exception state because 
the rule part prescribes to go in a wall and so the exception condition is fulfilled. 
We associate with a program and an underlying MDP a table £ from full state 
representation to boolean. Let s' = {p,s,l,n) G S'. If £{s') = false then, in 
s', we only use the action of the rule part of procedure p. If £{s') = true then, 
in s', we only use the action of the exception part of procedure p. We call a 
state s for which £{s) = true an exception state. Initially all entries of this 
table are set to false. In the above exemple, when the agent, executing the 
program in the underlying MDP, encounters the state (main, (3,3) ,[],[]), we 
set £((main, (3 , 3) ,[],[]) ) to true and now, in this state, the agent will rely 
only on the actions of the exception part. 

After few iterations of the program we present in Figure 2 (b), in shaded cells, 
the exception states (a shaded cell with coordinates {x, y) as to be interpreted 
as a full state representation (main, (a;, y) , [],□). In this case, it is sufficient 
to solve the task. 

But this mechanism is very limited, for example. Figure 3 (a), the agent can- 
not escape of the dead-end because only states (main, (6, 4) ,[],[]) , (main, (6, 5) , 
[],[]) and (main, (6, 6) ,[],[] ) of the induced SMDP will become exception 
states. We need a way to propagate this information back to the predecessor 
states. This way of propagating exception states is called the propagation mech- 



anism. 
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Definition 2 We say that si — f S 2 is a rule part action path from 

Si to Sn in A4' iff for all Si where 1 <i < n, there exists an action a € A'r(si) 
for which T'{si, a, Si+i) > 0. 



Definition 3 We denote by s ^ s' that the action a has led to state s' starting 
from state s. 

We now define the property that must fulfill a propagation mechanism. 

Definition 4 Let s be a state of the induced SMDP where £(s) = false. If for 
each action a of A'r{s), there exists a rule part action path s ^ Sn where 

£{sn) = true then, the propagation mechanism insures that after a finite number 
of time steps, £{s) will become true. 

For example, with any propagation mechanism, the program of the section 
2.2, will now solve the dead-end example (see the properties section). An example 
of exception states with a propagation mechanism is given Figure 3 (b) for the 
dead-end example and after few iterations of the program. 

Those propagation mechanisms can be designed by the user of our method 
but, we give here, a very simple and general propagation mechanism we will use 
in the result section. We call it the basic propagation mechanism: 

Definition 5 Let si and s^ be two states of S' where £{si) = false and £{ 82 ) = 
true. If the agent makes a transition between Si and S 2 then, £{si) is set to true. 

The basic propagation mechanism fulfills the propagation mechanism property 
(note that the basic propagation propagates more than needed by the definition 
4). Note that the basic propagation mechanism is cheap to compute, when exe- 
cuting an action a in s and ending up in state s' , for example, we just have to 
look at the value of £{s') to know if we have to propagate. The size of the table 
£ is always less than the size of the Q-table and each entry is just a boolean. 

2.6 Learning 

For a learning agent interacting with a Semi-Markov Descision Process, there 
exists a learning algorithm, called SMDP Q-learning which updates a state- 
action value function Q - which maps state-action pairs into real values (where 
actions can be primitive or temporally extended) - at every time period with the 
formula: 



Q{st,at) <- Q{st,at) + a{rt + fit max Q{st+i, a) - Q{st, at)) 

where at is the action (primitive or temporally extended) taken by the agent in 
St (a state of the SMDP) , st+i is the new state after executing at in s*, rt is the 
reward and Pt the discount factor received by the agent and a is the learning 
rate. This learnt Q-function converges to the optimal Q-function under technical 
conditions similar to those for conventional Q-learning (see [5]). 
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3 Properties 

We give in this section two properties of the EBRL method. We illustrate them 
with our artificial problem. We suppose in this section that every action gets 
executed in every state infinitely often. 

Definition 6 We say that s\ ^ S 2 ^ ^ Sn is an exception part action path 

from Si to Sn in M' iff for all Si,Si-|_i where 1 < i < n, there exists an action 
a € A'e(si) for which T'{si,a,Si+i) > 0. 

Definition 7 For a procedure p, we note S'p C S' the set of states of the form 
{p,s,l,n). 

Definition 8 For a procedure p, we note Bp C S'p the set of states of the form 
{p,s,l,n) for which there exists an exception part action path in A4' leading to 
a state s' for which the termination condition of the procedure p is fulfilled. 

Theorem 1 For a procedure p, if for each s € S'p, for each action a € A'r{s), 
there exists a rule part action path s A- •••—>■ s' leading to a state s' for which, 
either the termination condition of the procedure p is fulfilled or, s' is a direct 
exception state then, for each state of Bp, the agent executing the procedure p 
can reach a state in S'n for which the terminal condition of the procedure p is 
fulfilled. 

Proof: 

a) For all s G S'p, by hypothesis, 

• Either there exists, for an action a G _4V(s), a rule part action path 
s A •••—>■ s' for which s' fulfills the termination condition of p. As 
A'r(s) ^ A'e(s) using either set of actions (A'r(s) or A' e{s), depending 
of the value of £{s)) we can reach the terminal condition of p. 

• Or, for all a G A'r{s) there exists a rule part action path s A- •••—>■ s' for 
which £{s') = true and so, by definition of the propagation mechanism, 
£{s) will become true after a finite number of iterations. 

b) Let Si G Bp then, by definition, there exists an exception part action path 

Si S2 s„ where s„ is a state for which the termination condition 

of the procedure p is fulfilled. For all si in this path where 1 < i < n. If 
£{si) = false then by a), the agent can reach a state s' from s^ for which 
the terminal condition of the procedure p is fulfilled or £{si) will become 
true after a finite number of time steps. But, if £{si) = true then, there 
exists an exception part action which leads to Si+i. □ 

For example we can prove with this theorem that the program defined in 
section 2.2, with a propagation mechanism, will get the agent to the cross, if it 
is possible to go to it in the underlying MDP using all the primitive actions. This 
is true even if the actions are stochastics {p% of the time, the action is executed 
correctly and (1— p)% of the time a randomly chosen action is executed instead). 
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Definition 9 We note £{Ai') the SMDP obtained from A4' in which for each 
state s € S' if £{s) = true, we only use the actions of A'e(s) else, we only use 
the actions ofA'r{s). 



Proposition 3 After a finite number of time steps, £ does not change anymore. 
We then note the £ table by £f. 

Proof: The table £ has a finite number of entries and for a state s for which 
£{s) = false by the definition of the direct exception states and the propagation 
mechanism, either £{s) remains false or after a finite number of time steps £{s) 
becomes true. □ 

Theorem 2 SMDP Q-learning, applied to an agent executing V in M with ex- 
ception table £, will converge to the optimal policy in £’y(construct-SMDP(AI, P, 
7 )) w.p.l. if ^a = QO and < 00 . 

Proof: fy(construct-SMDP(AI, P, 7)) is a finite SMDP fulfilling the precondi- 
tions of the theorem 2 of Parr, R. [5] . □ 

This theorem tells us that we will obtain the best policy in the SMDP ob- 
tained by the agent executing the program in the underlying MDP. This policy 
could be non optimal in the underlying MDP. Note that we can increase the 
search space and possibly the quality of the solution by relaxing the constraints 
in the states for which £{s) = false. In doing so, for the problem of section 
2.2 and for the problem of the following section, we are guaranted to find the 
optimal solution in the underlying MDP. 

4 Example 

We will use a task very similar to the Sokoban game to illustrate our method 
because of its complexity. 

We put an agent (who can move in 8 directions: north, north-east, east, . . . ) in 
a grid. A ball, a goal and walls are placed in the grid. The aim of the agent is 
to push the ball into the goal (see Figure 5 (a), where the agent, the ball, the 
goal and the walls are represented by a triangle, a filled circle, a cross and filled 
cells respectively). We assume the agent knows the ball and goal location. As 
the agent can only push the ball but not pull it, there are many situations in 
which the ball can become stuck or can have a limited set of cells in which it can 
be moved. The actions are stochastics: 90% of the time the action is executed 
correctly and 10% of the time another randomly chosen action is executed. 

4.1 Task Program 

We write in our formalism a program to help the agent solve a given grid con- 
figuration, this program is described in Figure 4. 

The program is broken-down into two sub-tasks: firstly, go to the ball and 
secondly push the ball to the goal location. The {go to ball actions} set of the 
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main (state) — > 

termination : the ball is in. the goal location 
rule : Go_to_ball() 

next : Go_to_goal() 

go_to_ball( state) > 

termination : the agent is next to the ball 

rule : {go to ball actions} 

exception : (all the actions in {go to ball actions} prescribe 
to go into a wall, {all the primitive actions}) 
next : go_to_ball() 

go_to_goal(state) — > 

termination : the ball is in the goal location 
rule : if the agent is next to the ball then 

{go to goal actions} else go_to_ball() 
exception : (all the actions in {go to goal actions} prescribe 
to push the ball in a wall or the cell where the 
agent has to go to push the ball in this 
direction is a wall, 

{push_north( [ball_location] ) , 
push_north_east ( [ball_location] ) , • ■ ■ }) 
next : go_to_goal() 



push_north(state, [last_ball_location] ) — > 
termination : ball has been pushed 
rule : let (cc, y) the cell to go, to push to 

the north. 

if can go to (x, y) then 
go_to( [last_ball_location;xjy] ) else 
learn_to_go_to( [x;y] ) 
next : push( [last_ball_location] ) 

go_to(state, [last_ball_location;x;y] ) > 

termination : ball has been pushed or agent on (x, y) 
rule : {go to actions} 

next : go_to( [last_ball_location;x;y] ) 

leam_to_go_to (state, [last_ball_location;x;y] ) — > 
termination : ball has been pushed or agent on {x,y) 
rule : {learn to go to actions} 

exception : (all the actions in {learn to go to actions} 
prescribe to go into a wall, 

{all the primitive actions}) 

next : learn_to_go_to( [last_ball_location;x;y] ) 



Fig. 4. Program to help the agent to solve the task, see the text for details about the 
rule parts. 



go_to_ball procedure is the set of actions which make the agent get closer to the 
ball without taking the walls into account (there is between one and three such 
actions for a given state). The {go to goal actions} set of the go_to_goal proce- 
dure is the set of procedures amongst push_north, push_north_east, . . . which 
would make the ball get closer to the goal if there are no walls (there is between 
one and three such procedures). Note that in the rule part of the go_to_goal pro- 
cedure, we test if the agent is next to the ball, this is because the actions could be 
stochastics. We only write the code for push_north because push_north_east, 
. . . , are similar. In push_north, we test if we can go to the {x,y) cell. We just 
look at the 8 cells around the ball and test if there is a path using only those 
8 cells; the {go to actions} set contains the action to take to go to (x,y) using 
those 8 cells when there is a path using only those 8 cells. The {learn to go to 
actions} set of the learn_to_go_to procedure is the set of actions which make 
the agent get closer to the ball without taking the walls into account. The push 
procedure terminates if the agent is not next to the ball and else pushes the ball. 
We can notice that the program gives explicitly only a very high level knowledge. 
It does not know how to avoid obstacles. 



5 Results 

In this section, we test our method on the grid configuration, presented in Figure 
5 (a), which is quite difficult for the a priori knowledge given by the program. 
If we use the program without propagation mechanism, this task could not be 
solved. An epoch consists of 800 primitive actions, —0.01 reinforcement is given 
on each transition in the grid except when the ball is pushed into the goal location 
where a reinforcement of 10 is given. We use an e-greedy policy of parameter 
0.9, the discount factor 7 is 0.999 and the learning rate a is 0.1. We plot two 
curves (Figure 5 (b)), one for the Q-learning algorithm and one for our method. 
Each 10 epochs we set the policy to the greedy one and plot the results which 
are averaged over ten runs and smoothed. The mean first time the greedy policy 
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25x25 grid number of epochs 

results 

Fig. 5. Comparaison between Q-learning and our method. 

solves the task is after 458 epochs with our method and after 248621 epochs 
for the Q-learning one. The number of states memorized at the end of this 
experimentation is 138214 for our method and 248948 for the Q-learning one. 
Note that the basic propagation is quite expansive in the number of memorized 
states but, explores a larger state space and so can find a better solution. Using 
various grid configurations, we noted that the learning time with our method 
depends more on the difficulty of the grid compare to the initial knowledge than 
on the state space size. Moreover, we also came to the conclusion that the larger 
the state space the better our method compared to Q-learning. 



6 Conclusion 

We have presented in this paper a method to introduce knowledge into rein- 
forcement learning to speed-up learning using temporally extended actions. Our 
work can be related with the Parti-game algorithm of Moore [12] where a greedy 
controler helps the agent to reach a goal state. To deal with getting trapped, the 
Parti-game algorithm divides more and more thinly the state space to circum- 
vent becoming trapped. Our method do not divide the state space, the resolution 
of the state space is given but the number of choices to be made in each state is 
variable. With variable resolution we can potentially store less states but note 
that if the a priori knowledge allows only one action in a given state, this state 
does not have to be stored (no choice has to be made in this state). We can 
formulate in our method that more than one action seems good a priori and 
so test different potentially good paths to the goal before increasing the search 
space. Moreover, we do not assume that the dynamic of the environment is 
deterministic, we can learn arbitrary reward functions and we do not need to 
learn a greedy controller. One of the main drawback of constraining the number 
of available actions is that it can be difficult to guarantee that with this a priori 
knowledge the agent can still solve the task. In this paper, we formulate and 
prove a theorem which can be used to guarantee that a given task can still be 
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solved with our method. We tested our method on a complex grid-world task and 
showed that the learning time is drastically reduced compared to Q-learning. We 
currently test our method in a continuous state space. 
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Abstract. In this paper we study the influence of noise in probabilistic 
grammatical inference. We paradoxically bring out the idea that special- 
ized automata deal better with noisy data than more general ones. We 
propose then to replace the statistical test of the Alergia algorithm by 
a more restrictive merging rule based on a test of proportion comparison. 
We experimentally show that this way to proceed allows us to produce 
larger automata that better treat noisy data, according to two different 
performance criteria (perplexity and distance to the target model). 

Keywords: probabilistic grammatical inference, noisy data, statistical 
approaches 



1 Introduction 

Nowadays the quantity of data stored in databases becomes more and more im- 
portant. Beyond the fact that the amount of information is hard (in terms of 
complexity) to process by machine learning algorithms, these data often contain 
a high level of noise. To deal with this problem, many data reduction techniques 
aim at either removing irrelevant instances (prototype selection [1]) or deleting 
irrelevant features (feature selection [2]). These techniques always need positive 
examples and negative examples of the concept to learn. An outlier is seen as a 
positive (resp. negative) instance which should be negatively (resp. positively) 
labeled in absence of noise. However, in some real applications, it is difficult, even 
impossible, to have negative examples, that is for example the case in natural 
language processing. In such a context, learning algorithms exploit statistical 
information to infer a model allowing to define a probability distribution on 
positive data. Because of the absence of negative examples, standard data re- 
duction techniques are not adapted for removing outliers, which require in fact 
specific processes. In the context of probabilistic models, an outlier can be seen 
as a weakly relevant instance, z.e. weakly probable because of noise. While such 
models are a priori known to be more efficient for dealing with noisy data, no 
study, as far as we know, has been devoted to analyze the impact of noise in the 
specific field of probabilistic grammatical inference. 



N. Lavrac et al. (Eds.): ECML 2003, LNAI 2837, pp. 169-180, 2003. 
(c) Springer- Verlag Berlin Heidelberg 2003 
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Grammatical inference [3] is a subtopic of machine learning which aims at 
learning models from a set of sequences (or trees). Probabilistic grammatical 
inference allows to learn probabilistic automata defining a distribution on the 
language recognized by the automaton. In this framework, the data (always 
considered as positive) are supposed to be generated from a probability distri- 
bution, and the objective is to learn the automaton which generated the data. A 
successful learning task produces a probabilistic automaton which gives a good 
estimation of the initial distribution. 

In this paper we are interested in probabilistic grammatical inference algo- 
rithms based on state merging techniques. In particular, we study the behavior 
of the Alergia algorithm [4, 5] in the context of noisy data. Our thought con- 
cerns the generalization process: we think that a generalization issued from the 
merging of noisy and correct data in the same state is particularly irrelevant. 
This can be dramatic, especially in cyclic automata, because this kind of gener- 
alization could increase the deviation from the initial distribution. Then we need 
to restrict the state merging rule for avoiding such situations. In Alergia, the 
generalization process consists in merging states that are considered statistically 
close according to a test based on the Hoeffding bound [6] . However this bound is 
an asymptotic one and is then only relevant for large samples. To deal with small 
sets, [7] proposed a more general approach (called MAlergia) using multino- 
mial statistical tests in the merging decision. Despite its good performances with 
small dataset sizes, MAlergia has a major disadvantage: a high complexity on 
very small datasets for which the calculation of a costly statistic is needed. In 
this paper we overcome both the Alergia and MAlergia drawbacks. We re- 
place the original test of Alergia by a more restrictive one based on a test of 
proportion comparison. This test can deal with both large and small datasets 
and we show experimentally that it better performs in the context of noisy data. 

After a brief recall about probabilistic finite state automata and their learn- 
ing algorithms, we describe in Section 2 the state merging rule of the algorithm 
Alergia and its extension with a multinomial approach in MAlergia. In Sec- 
tion 3, we propose a new approach based on a test of proportion comparison. 
We theoretically prove that the bound of our test is always smaller than the 
Hceffding’s one, expressing the fact that the merge will be always more difficult 
to be accepted in presence of noise. We also relate our work in comparison to 
the multinomial approach. Section 4 deals with experiments comparing the three 
approaches with different levels of noise. 

2 Learning of Probabilistic Finite State Automata 

Probabilistic Finite State Automata (PFSA) are a probabilistic extension of 
finite state automata and define a probability distribution on the strings recog- 
nized by the automata. 
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2.1 Definitions and Notations 

Definition 1 A PFSA A is a 6-tuple {Q, S, S,p, qo, F). Q is a finite set of states. 

5 is the alphabet. S : Q x S ^ Q is the transition function, p : Q x S ^ [0,1] 
is the probability of a transition, qo is the initial state. F : Q ^ [0, 1] is the 
probability for a state to be a final state. 

In this article, we only consider deterministic PFSA (called PDFA), i.e. where 

6 is injective. This means that given a state q and a symbol s, the state reached 

from the state q by the symbol s is unique if it exists. In order to define a 
probability distribution on £’* (the set of all strings built on A), p and F must 
satisfy the following consistency constraint: Vq e Q,F{q) + ~ 1- 




Fig. 1. A PDFA with go = 0 and its probabilities 



A string sq ■ • ■ Sn-i is recognized by an automaton A iff there exists a se- 
quence of states Co . . . 6n such that: (i) Cq = qo, (ii) Vi G [0, ? — 1], S(ei, Si) = Ci+i, 
(iii) F{en) yf 0. Then the automaton assigns to the string the following proba- 
bility: 

Pa{so . . . s„_i) = (l7”To^p(ei, Si)) * F(e„) 

For example the automaton represented in Figure 1 recognizes the string baaa 
with probability 0.75 x 1.0 x 0.2 x 0.2 x 0.6 = 0.018. 

2.2 Learning Algorithms 

A lot of algorithms have been proposed to infer PDFA from examples [4, 5, 7-9]. 
Most of them follow the same scheme based on state merging and summarized in 
Algorithm 1. Given a set of positive examples S+, the algorithm first builds the 
probabilistic prefix tree acceptor (PPTA) . The PPTA is an automaton accepting 
all the examples of S+ (see left part of Figure 2 for an example, A corresponding 
to the empty string). It is constructed such that the states corresponding to 
common prefixes are merged and such that each state and each transition is 
associated with the number of times it is used while parsing the learning set. 
This number is then used to define the function p. If C{q) is the number of 
times a state q is used while parsing S'+, and C{q,a) is the number of times 
the transition (g, a) is used while parsing then p{q,a) = Similarly, if 

Cf{q) is the number of times q is used as final state in S+ for each state q, we 
have F{q) = 

The second step of the algorithm consists in running through the PPTA 
(function choosestates{A)), and testing whether the considered states are sta- 
tistically compatible (function compatible {qi, qj,a)). Several consecutive merging 
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Data: S+ training examples (strings) 

Result: A a PDFA 
begin 

A ^ build_PPTA(5+); 
while (qi,qj) ^ choose states{ A) do 
I if compatible{qi , qj , a) then merge(A,qi,qj); 

end 

return A; 
end 

Algorithm 1. Generic algorithm for inferring PDFA 




Fig. 2. PPTA of S+ = {ba, baa, baba, A} on the left. On the right the PDFA obtained 
after two state mergings 



operations are done in order to keep the automaton structurally deterministic. 
The algorithm stops when no more merging is possible. For example, the right 
part of Figure 2 represents the merging of the states labeled b and bab, and the 
merging of the three states labeled ba, baa, baba from the PPTA on the left part 
of the figure. 



2.3 Compatibility in the Algorithm Alergia 

In Alergia [5], the compatibility of two states depends on: (i) the compatibility 
of their outgoing probabilities on the same letter, (ii) the compatibility of their 
probabilities to be final and (iii) the recursive compatibility of their successors. 



Definition 2 Two states qi, Q 2 are compatible iff: (if^a G S 



C{ql,a) 

C(ql) 



C(q2,a) 

C(q2) 



is not significantly higher than 0, (ii) 



is not significantly higher 



Cfjql) _ Cf{q2) 

C{ql) C(q2) 

than 0, (iii) the two previous conditions are recursively satisfied for all the states 
reachable from {ql,q2). 



The notion of significance is statistically assessed in Alergia. It consists in 
comparing the deviation between two proportions: | ^ — ^ |, where nl = C{qi), 
n2 = C{q 2 ), xl equals either C{q\,a) or Cf{qi) and x2 either C{q 2 ,a) or Cf{q 2 ) 
{a G A). 

The test of compatibility is derived from the Hceffding bound [6] . This bound 
is used to define a probability on the estimation error of a Bernoulli variable p 
estimated by the quantity which is a frequency observed over n trials. 
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> I — a 



Since Alergia takes into account two frequencies and f|), it must add 
two estimation errors that assesses, in a way, the worst possible case. 

Definition 3 Two proportions are compatible in Alergia iff: 



xl x2 ^ /I 2 / I 1 \ 

^V2 



( 1 ) 



Despite the fact that this upper-bound is statistically correct, we can note 
that by adding two estimation errors, the test tends to often accept a state 
merging. Consequently, the probability to wrongly accept a merging (risk of 
second type /3) is under-estimated, that can have dramatic effects on the final 
automata, particularly in the presence of noise. Moreover, the asymptotic bound 
introduced in inequality (1) is only relevant for large samples. In order to over- 
come this drawback, Kermorvant and Dupont have proposed MAlergia [7] for 
dealing with small datasets. 



2.4 Compatibility in the Algorithm MAlergia 

In MAlergia, each state of the automaton is associated with a multinomial dis- 
tribution modeling the outgoing transition probabilities and the final probability. 
In other words, each state is associated with a multinomial random variable with 
parameters r = {ti, . . . , tk }, each corresponding to the transition probability 
of the letter of the alphabet including a special final state symbol. In the 
PPTA each state is seen as a realization of the multinomial random variable t 
( see [7] for more details). Two states are merged if they are both a realization 
of the same multinomial random variable. A statistical test following asymptot- 
ically a Khi-square distribution is used. When the constraints of approximation 
are not verified (z.e. for very small datasets), a Fisher exact test is used. How- 
ever, in MAlergia, this test requires the estimation of the probability of all 
contingency tables of size 2 x AT of the same marginal counts, that results in a 
very high complexity of the algorithm. 



3 A New Compatibility Test Based on Proportions 

In this section, we propose a new statistical approach overcoming the drawbacks 
of Alergia and MAlergia and particularly relevant in presence of noise. 

3.1 Statistical Ftamework 

We use here a test of proportion comparison. It aims at comparing the propor- 
tions ^ and ^ (the same as those used in Alergia), estimators of the proba- 
bilities pi and p2, and testing the hypothesis: Hq : pi = p2 versus Ha '■ pi p2. 
We compute the statistic: 
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xl 

nl 






. (nl+n2) 

W „i„2 



where p = 1 — q = 



xl + x2 
nl + n2 



Z approximately follows the normal distribution when Hq is true. We reject 
Hq in favor of Ha whenever \Z\ > where Za /2 is the (1 — a/2)-percentile of 
the normal distribution. 

Then we consider that two proportions are not statistically different if: 



xl 

nl 



x2 

n2 



< 



pq- 



. (nl + n2) 
nl * n2 



Note that the constraints of approximation are satisfied when nl + n2 > 20 
or when nl + n2 > 40 when either xl or x2 is smaller than 5. When these con- 
ditions are not satisfied, we use a Fisher exact test, without the high calculation 
constraints of MAlergia. 



3.2 Theoretical Comparison 

We have seen before that the risk (3 is under-estimated in Alergia. We prove 
now that our test results in a more restrictive merging rule. 



Theorem 1 Vo < 0.734, VO < a' < 1 : 



z%\ pq 



, (nl -I- n2) 
nl * n2 



< 










Proof. First we denote: A = z- and B = ^ 

Since p <1 and q < 1 and so -y/w ^ then we can deduce that: 




A < Z£L 



(nl -I- n2) 
nl * n2 



= z^ 



1 

n2 



1 

nl 



Then if we choose a < 0.734, then the (1 — ^(-percentile of the standard 
normal distribution is lower than 0.34, thus: z<^ < 0.34 < ^ ln(2) < ^ ^ ln(2). 
Moreover, for all 0 < a' < 1, ln(2) < In(^), then 



and then since 4^-1-^ < 4--|-4 tH — for all nl > 0 and n2 > 0, then 

nz Tii ni nz \/ nl\/ n2 

\/ -^ + -K < -I- and we conclude that: 

y nz ni y'nl -\/n2 



A < 



Jiln(4)(^ 

V 2 a' Vnl 




□ 
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Fig. 3. Effect of a bad merging from a PPTA built with 4 strings ac and 6 strings bb 



Assuming that we rarely build statistical tests with a higher than 0.5, the 
condition a < 0.734 is not too constraining. The first direct consequence of this 
theorem is that our new merging rule is more restrictive, limiting the impact 
of potential noisy data. Secondly, such a rule tends to infer larger automata, 
both in the number of states and in the number of transitions. This situation 
can seem paradoxical. Actually, according to the theory of the learnable, and 
particularly in exact grammatical inference, too large automata tend to overfit 
the data resulting in a decrease of the generalization ability. This is true in exact 
grammatical inference, when we have both positive and negative examples, and 
when the goal consists in building a classifier which can predict, via a final state, 
the label of a new example. In this case, one must relax the merging constraint in 
the presence of noise, to allow a legitimate merging. Then we aim at inferring an 
automaton as small as possible to reduce the complexity of the model and then its 
VC-dimension. The problem seems to be different in probabilistic grammatical 
inference, where the inferred automaton is only able to provide a probability 
distribution. The error imputable to the automaton can come not only from an 
over-estimation but also from an under-estimation of the probability density. 
In this case, what is the consequence of a wrongly accepted merging due, for 
example, to the presence of noise? Figure 3 shows an explicit example. Before 
the merging, the probability of a string ac is 0.4 * 1 * 1 = 0.4 and 0 for a string 
ab. Assume that a “bad” merging (of the states 1 and 2) is accepted, ac becomes 
under-estimated (0.4 * 0.4 * 1 = 0.16) and ab becomes more probable than ac 
(0.4 * 0.6 * 1 = 0.24). This example shows that, particularly in the presence of 
noise but also in noise-free situations, we must reduce the risk j3, resulting in 
the rejection of some mergings, and then in the inference of larger automata. 

Thus, we think that the use of a more specific and restrictive test is more rele- 
vant for dealing with noise. In MAlergia, Kermorvant and Dupont empirically 
note that their merging rule is also more restrictive. However, we think that it is 
not sufficient. Actually in the multinomial approach, the frequencies of a noisy 
transition can be absorbed by the global aspect of the test. In our proportion- 
based test, the merging rule is applied on each transition allowing us to better 
detect differences between the two tested states. Our proportion test also works 
with small samples and thus has not the problem of the asymptotic Hoeffding 
bound. For very small samples, a Fisher test is used. While, in the multinomial 
approach, the number of contingency tables to consider increases exponentially 
with the size of the alphabet (K), in our framework, we only consider tables of 
a constant size 2x2. We reduce then the complexity of the test. 
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4 Evaluation in the Context of Noisy Data 

We compare, in this section, automata inferred with the test based on the Hceffd- 
ing bound, those obtained with the multinomial approach and those obtained 
with the proportion-based test, in two types of situations. The first one deals 
with cases where the target automaton is a priori known. In this case we can 
measure the distance between the inferred automata and the target automaton. 
However we do not know always this one. In this case, we evaluate the merging 
rules in another series of experiments using a perplexity measure. This criterion 
assesses the relevance of the model on a test sample. In order to show the effec- 
tiveness of our approach, experiments were done on two types of data, strings 
and trees. 



4.1 Evaluation Criteria 



Distance from the target automaton: [10] defined distances between two 
hidden Markov models introducing the co-emission probability, that is the prob- 
ability that two independent models generate the same string. The co-emission 
probability of two stochastic automata Ml and M2, is denoted A{M1, M2) and 
defined as follows: A{M1, M2) = Pmi{s) * Pm 2 {s). Where PMi{s) is the 

probability of s given the model Mi. The co-emission probability allows us to 
define a distance Da between two automata Ml et M2: 

Da (Ml, M2) = arccos ( a(mim 2 ) 

' ’ ' 1 ^ A(M1,M1)*A(M2,M2) 

Da{Ml, M2) can be interpreted as the measure of the angle between two vectors 
representing the automata Ml, M2 in a space where the base is the set of strings 
of r*. 



Perplexity measure: When the target automaton is not known, the quality 
of an inferred model M can be evaluated by the average likelihood on a set of 
strings S relatively to the distribution defined by M: 

Pj[Ei=l log PM(Si) 

where Pnisj) defines the probability of the string of S according to M. A 
perfect model can predict each element of the sample with a probability equal to 
one, and so LL = 0. In a general way we consider the perplexity of the test set 
which is defined by PP = 2^^. A minimal perplexity {PP = 1) is reached when 
the model can predict each element of the test sample. Therefore we consider 
that a model is more predictive than another if its perplexity is lower. 

4.2 Experimentations on Strings 

Recall that our objective is to study the behavior of the three merging rules in 
the context of noisy data. To corrupt our training file, we replace a proportion 7 
(from 0.01 to 0.30) of letters of the training strings by a different letter randomly 
chosen in the alphabet. For each level of noise, we use several a parameters from 
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Base 


Size 


H 


P 


M 


Sig 


Reber Da 


3000 


0.20 ± 0.153 


0.16 ± 0.12 


o 

o 

CO 


yes 


Reber Pe 


3000 


1.76± 0.14 


1.74 ±0.13 


1.75 ±0.13 


yes 


ATIS Pe 


~ 7000 


92.4± 11.58 


62.7±6.25 


64.4±9.49 


yes 


Agaricus -|- Pe 


4208 


2.23 ± 0.80 


1.86±0.37 


1.92±0.48 


yes 


Agaricus - Pe 


3918 


2.64T1.21 


2.06±0.52 


2.13±0.60 


yes 


Badges + Pe 


210 


24.6 ±2.51 


22.3±2.19 


20.0±2.95 


yes 


Badges - Pe 


120 


27.3±2.6 


24.3±2.31 


20.5±3.11 


yes 


Promoters -|- Pe 


56 


3.80T0.07 


3.93±0.05 


3.91±0.16 


no for P vs M 


Promoters - Pe 


56 


2.61T0.79 


2.79±0.62 


2.47±0.96 


yes 



Fig. 4. Results on databases of strings. Yes in the column Sig means that all the 
deviations between H and P, H and M and P and M are signihcant, otherwise we 
indicate which deviation is not significant 



0.0001 to 0.1. The results presented in this section correspond for each approach 
to the optimal a, that is the one which provides the smaller evaluation measure. 
Since we use different levels of noise, the results are presented for each dataset by 
the mean ± the standard deviation. We test the significance of our results using 
a Student paired t-test with a first oder risk of 5%. In the presentation of our 
results, those concerning the Hoeffding test end with H, those for the proportion 
one with P, and those for the multinomial approach end with M. We indicate 
results obtained with Da for the distance and Pg for the perplexity. The column 
Sig indicates the significance of the results. 

We use a first database for which the target automaton is a priori known. This 
one represents the Reber grammar [11]. When the target is unknown, we suppose 
to have a training set and a test set. Only the first one contains noisy data. We 
evaluate the perplexity measure on the test set. We use here eight databases: a 
sample generated from the Reber grammar; the ATIS database [12]; and three 
databases of the UCI repository [13]: Agaricus, Badges and Promoters. For these 
three bases, we consider positive and negative examples as two different concepts 
to learn. We use a 5-folds cross validation procedure for all the databases, except 
for the ATIS one which already contains a training and a test set, and for which 
we use different sizes of the training set (from 1000 to 13044). 

The results of the experiments are synthesized on Figure 4. Globally, and 
independently on the complexity costs, which are highly in favor of our test, 
the merging rules based on our proportion test and on the multinomial test 
provide better results than Alergia, except for Promoters. This result can be 
explained by the relatively small size of the sample. Globally, the multinomial 
test works better than our approach on small datasets (Badges, Promoters), this 
fact confirms the original motivation of MAlergia. However when the size of the 
training set grows, the proportion based-test is better (Reber, ATIS, Agaricus). 
Gonsidering the level of noise, we noted that the results are highly in favor of 
our approach, particularly when the noise is higher than 8%. This behavior on 
the database Agaricus is shown on Figure 5. Note that the difference between 
the two approaches increases with the level of noise. 
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Agaricus + Agaricus - 




0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 

Perplexity Perplexity 



Fig. 5. Behavior of the merging rules on Agaricus w.r.t. different levels of noise 

4.3 Experimentations on Trees 

Since the interest about tree-structured data is increasing, notably because of 
their huge availability on the web, we also propose to evaluate our extension of 
Alergia to stochastic tree automata [14] (note that we consider bottom-up tree 
automata) . The multinomial approach is not compared here because its adaption 
to bottom-up tree automata is not trivial. 

Stochastic Tree Automata (STA): Tree automata [15] define a regular lan- 
guage on trees as a PDFA defines a regular language on strings. Stochastic tree 
automata are an extension of tree automata, defining a probability distribu- 
tion on the tree language defined by the automaton. We use an extension of 
these automata taking into account the notion of type: stochastic many-sorted 
tree automata defined on a signature. We do not detail here these automata 
and their learning method. The interested reader can refer to [14, 16]. We only 
precise that a learned stochastic tree automaton allows to define a probability 
distribution on trees recognized by the automaton. In the context of trees, we 
change a proportion 7 of leaves in order to corrupt the learning set. 

Experiments: We use three target grammars, one concerning stacks of objects, 
one on boolean expressions and another artificial dataset Art2. From each gram- 
mar we generate a sample of trees. We keep the same protocol as presented for 
experiments on strings. For cases where the target automaton is unknown, we use 
five datasets. We take a sample from each of the three previous tree grammars. 
Then we also use the database exploited for the PKDD’02 discovery challenge^ 
(converted in trees as described in [17]). Finally, we treat the database Student 
Loan of the UCI repository, converting prolog facts in trees as describes in [18]. 
The results of the two series of experiments are presented on Figure 6. Experi- 
mentations on trees confirm the results observed on strings. Automata obtained 
using the proportion test are better with a lower standard deviation than those 
inferred with the test based on the Hoeffding bound. 

^ http:/ /lisp. vse.cz/challenge/ecmlpkdd2002/ 





Improvement of the State Merging Rnle on Noisy Data 



179 



Base 


Size 


Da H 


Da P 


Sig 


Stacks Da 


3000 


0.241 ± 0.164 


0.225T0.17 


yes 


Art2 Da 


3000 


0.555 ± 0.138 


0.190T0.1 


yes 


Bool. Da 


5000 


0.1 ± 0.049 


0.096T0.046 


yes 


Stacks Pe 


3000 


1.85 ± 0.056 


1.78T0.063 


yes 


Art2 Pe 


3000 


3.68 ± 0.45 


3.21T0.21 


yes 


Bool. Pa 


4000 


2.60 ± 0.026 


2.45T0.01 


yes 


PKDD'02 Pe 


4178 


6.90 ± 1.99 


1.94 ±0.14 


yes 


Student Loan Pe 


800 


5.09T 1.48 


2.88 ±0.26 


yes 



Fig. 6. Results for trees on the 8 databases 



5 Conclusion 

In this paper, we addressed the problem of dealing with noise in probabilistic 
grammatical inference. As far as we know, this problem has never been studied 
but seems very important because of the wide range of applications it is related 
to. Since the main objective in the probabilistic grammatical inference frame- 
work is to correctly estimate the probability distribution of the examples, we 
brought out the paradoxical fact that larger automata deal better with noise 
than more general (smaller) ones. We studied this behavior in the context of 
state merging algorithms and gave the intuitive idea that a bad merging, due to 
the presence of noise, could lead to a very bad estimation of the target distri- 
bution. Consequently we propose to use a restrictive statistical test during the 
inference process. Practically, we have proposed to replace the initial statistical 
test of the Alergia algorithm by a more restrictive one based on proportion 
comparison. We have proved its restrictiveness and shown its interest, in the 
context of noisy data both on artificial and real datasets. 

While our approach deals better with noise, we have empirically noticed, in 
noise-free situations, that the results are quite similar with those of Alergia. 
We have also compared our test with the multinomial approach used in MA- 
LERGIA. Our proportion-based test is not only relevant, in terms of complexity 
and perplexity, on small and large datasets, but also provide better results for 
high level of noise. 

We are currently working on theoretical aspects of our work. We aim at 
proving that the acceptance of a bad merging, especially in the context of noisy 
data, implies a larger deviation from the target distribution than its rejection. 
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Abstract. The design of a Multi- Agent System (MAS) to perform well 
on a collective task is non-trivial. Straightforward application of learning 
in a MAS can lead to sub optimal solutions as agents compete or inter- 
fere. The collective INtelligence (COIN) framework of Wolpert et al. 
proposes an engineering solution for MASs where agents learn to focus 
on actions which support a common task. As a case study, we investi- 
gate the performance of COIN for representative token retrieval prob- 
lems found to be difficult for agents using classic Reinforcement Learning 
(RL). We further investigate several techniques from RL (model-based 
learning, Q( A )) to scale application of the COIN framework. Lastly, the 
COIN framework is extended to improve performance for sequences of 
actions. 



1 Introduction 

As argued by Wellman [14,15], a computational problem can be considered as a 
resource allocation problem. Borrowing from the insights of economics, it is how- 
ever becoming increasingly clear that few concepts for resource allocation scale 
well with increasing complexity of the problem domain. In particular, centralized 
allocation planning can quickly reach a point where the design of satisfying solu- 
tions becomes complex and intractable. Conceptually, an attractive option is to 
devise a distributed system where different parts of the system each contribute 
to the solution for the problem. Embodied in a so-called distributed Multi- Agent 
System (MAS), the aim is thus to elicit “emergent” behavior from a collection 
of individual agents that each solve a part of the problem. 

This emergent behavior relies implicitly on the notion that the usefulness of 
the system is expected to increase as the individual agents optimize their behav- 
ior. A weak point of such systems has however long been the typical bottom- 
up type of approach: researchers first build an intuitively reasonable system of 
agents and then use heuristics and tuned system parameters such that - hope- 
fully - the desired type of behavior emerges from running the system. Only 
recently has there been work on more top-down type of approaches to establish 
the conditions for MASs such that they are most likely to exhibit good emergent 
behavior [1,4,2]. 
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In typical problem settings, individual agents in the MAS contribute to some 
part of the collective through its private actions. The joint actions of all agents 
derive some reward from the outside world. To enable local learning, this re- 
ward has to be divided amongst the individual agents where each agent aims to 
increase its received reward by some form of learning. However, unless special 
care is taken as to how this reward is shared, there is a risk that agents in the 
collective work at cross-purposes. For example, agents can reach sub-optimal 
solutions by competing for scarce resources or by inefficient task distribution 
among the agents as they each only consider their own goals (e.g. a Tragedy of 
the Commons [3]). 

The collective INtelligence (COIN) framework, as introduced by Wolpert 
et ah, suggests how to engineer (or modify) the rewards an agents receives for 
its actions (and to which it adapts to optimize) in private utility functions. 
Optimization of each agent’s private utility here leads to increasingly effective 
emergent behavior of the collective, while discouraging agents from working at 
cross-purposes. 

In particular, the work by Wolpert et al. explores the conditions sufficient for 
effective emergent behavior for a collective of independent agents, each employing 
“sufficiently powerful” Reinforcement Learning (RL) for optimizing their private 
utility. These conditions relate to (i) the learnability of the problem each agent 
faces, as obtained through each individual agent’s private utility function, (ii) 
the relative “alignment” of the agents’ private utility functions with the utility 
function of the collective (the world utility), and lastly (iii) the learnability of 
the problem. Whereas the latter factor depends on the considered problem, the 
first two in COIN are translated into conditions on how to shape the private 
utility functions of the agents such that the world utility is increased when the 
agents improve their private utility. 

Wolpert et al. have derived private utility functions that perform well on 
the above first two conditions. The effectiveness of this top-down approach and 
their developed utilities are demonstrated by applying the COIN framework to a 
number of example problems: network routing [20], increasingly difficult versions 
of the Al Ferrol Bar problem [17], and Braess’ paradox [11]. The COIN approach 
proved to be very effective for learning these problems in a distributed system. 
In particular, the systems exhibited excellent scaling properties. Compared to 
optimal solutions, it is observed that a system like COIN becomes relatively 
better as the problem is scaled up [17]. 

In recent work [10], the COIN framework has been applied to problems where 
different “single agent” RL algorithms are traditionally tested: grid-based world 
exploration games [10,8]. In this problem-domain, agents move on a grid- like 
world where their aim is to collect tokens representing localized rewards as effi- 
ciently as possible (e.g. [8]). For a Multi- Agent System, the challenge is to find 
sequences of actions for each individual agent such that their joint sequences of 
actions optimize some predetermined utility of the collective. The main result of 
[10], in line with earlier work, is that the derived utility functions as advocated 
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in the COIN framework significantly outperform standard solutions for using RL 
in collectives of agents. 

We observe that in [10] the used RL algorithm, Q-learning, is the same as 
used in previous work on COIN. However, for learning sequences of actions 
with RL, there are substantially more powerful methods which we adapt for the 
COIN framework. We further report some modifications to these RL methods 
to address specific issues that arise in the COIN framework. We find that using 
these methods, our enhanced COIN approach yields more optimal exploration 
while converging more quickly. 

We start from our replication efforts of the grid-world problem of [10]. We 
report an anomaly in that a collection of selfish, greedy agents proved to be 
performing similarly to the more elaborate and computationally intensive COIN 
approach. To find out whether this issue was isolated to the particular grid-world 
example chosen, we designed grid-worlds that require more coordination among 
the agents. We then found that in those cases COIN does provide significant 
improvements compared to simple, greedy agents. 

This document is structured as follows. In Section 2, we describe the COIN 
framework. In Section 3 we report on our reproduction of [10]. In Section 4 we 
present problems that require more coordinated joint actions of a MAS. We also 
introduce an extension for COIN for sequences of actions. In Section 5, we adapt 
a number of more advanced RL methods for learning sequences of actions to the 
COIN framework, and report on the performance improvements. In Section 6 
we discuss future work and conclude. 



2 Background: Collective INtelligence 

In this Section, we briefly outline the theory of COIN as developed by Wolpert 
et al. More elaborate details can be found in [21,17,18]. Broadly speaking, COIN 
defines the conditions that an agent’s private utility function has to meet to 
increase the probability that learning to optimize this function leads to increased 
performance of the collective of agents. Thus, the challenge is to define a suitable 
private utility function for the individual agents, given the performance of the 
collective. 

Formally, let C be the joint moves of all agents. A function G(C) provides 
the utility of the collective system, the world utility, for a given (. The goal is 
to find a C that maximizes G(C). Each individual agent 77 has a private utility 
function that relates the reward obtained by the collective to the reward that 
the individual agent collects. Each agent will act such as to improve its own 
payoff. The challenge of designing the collective system is to find private utility 
functions such that when individual agents optimize their payoff, this leads to 
increasing world utility G, while the private function of each agent is at the same 
time also easily learnable (i.e. has a high signal-to-noise ratio, an issue usually 
not considered in traditional mechanism design). 

Following a mathematical description of this issue, Wolpert et al. propose 
the Wonderful Life Utility (WLU) as a private utility function that is both 
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learnable and aligned with G, and that can also be easily calculated. In a collec- 
tive system consisting of multiple agents collectively collecting rewards (tokens) 
on a grid, as discussed in more detail in Section 3, the WLU for one agent rj at 
time t with respect to the collective is ([10]): 

WLT°, iO = GRtiO - (T(L^,<*+i, G) - T(L^,<*, G)) (1) 



where: 

— C is the joint moves of all the agents. 

— L is the location matrix of the agents over time. And 

• Lrj is the location of agent rj for all the time steps. 

• Lri^t is the location of agent rj at time step t. 

• Lri^<:t are the locations of agent rj at earlier time steps. 

• Lg are the location of agents other than rj. 

— T{L,9) returns the value of the tokens received from the location matrix^. 

• 6* is the location of the initial tokens. 

• T{L, 0) = y 9x,ymin{l, L^^y), i.e. the tokens picked up are the visited 
tokens, but no more than once. 

— GRt(C) = T{L<t+i,d) — T{L^t,d), i.e. the value of all the tokens picked up 
at time step t. 

Hence WLTyj.{() for agent rj at time step t is equal to the value of all the 
tokens picked up by all the agents for that step minus the value of the tokens 
picked up by the other agents fj at time step t. If agent rj picks up a token r at 
time step t, which is not picked up by the other agents, then t] receives a reward 
of r(r). If this token is however picked up by any of the other agents at time 
step t, then the first term GRt of Equation 1 is unchanged while the second term 
drops with the value of r. Agent rj then receives a penalty —T(r) for competing 
for a token targetted by one of the other agents fj. 

Compared to the WLU function, other payoff functions have been considered 
in the literature for distributed Multi-Agent Systems: the Team Game utility 
function (TG), where the world-utility is equally divided over all participating 
agents, or the Selfish Utility (SU), where each agent only considers the reward 
that it itself collects through its actions. These two common alternatives are 
extreme examples. The TG utility can suffer from poor learnability, as for larger 
collectives it becomes very difficult for each agent to discern what contribution 
is made (low signal-to-noise ration) , and the SU suffers from - potentially - poor 
alignment with the world-utility. The superiority of the use of private WLU func- 
tions with local reinforcement learning has been shown for a variety of problems 
[19,17,11]. In the terminology of the COIN framework, the WLU is factored and 
aligned with the world utility. The Aristocratic Utility (AU) is not treated in 
this work as [10] observes comparable performance for the AU and WLU while 
the former is significantly more difficult to implement. 

^ [10] uses V instead of T, which we use to not confuse the issue with the V as the 
valuation function in Q learning. 
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for(int x=0; x < n; x=x+l) 
for(int y=0; y < n; y=y+l) { 

tokens [x][y] = 1.2*(x+y-n)/(1.0*n); 
if(tokens[x][y] < 0.4) 
tokens [x][y] = 0; } 
tokens [n/2][(n/2)-l] = 1.0; 
tokens|(n/2)+l][(n/2)-l] = 1.0; 

(b) The algorithm 
Fig. 1. The original problem of [10] 

We compute the WLU for an agent r] by first letting all agents except r], 
i.e. fj, make their moves and only then moving rj. For N moves, the agents f) 
generate grids gridg to grid^-i where gridt documents the tokens picked up by 
agents i) at time step t and the rewards which can be experienced by rj for its 
moves. The agents fj start from grid grido filled with the initial tokens and with 
all grids gridt^o initially empty. At time t, agents i) pick up tokens from gridt 
at their current locations. A penalty (that is: the negative of the value of the 
token) is then substituted for this token in gridt- The grid gridt+i is then filled 
with (a copy of) the remaining tokens of gridt prior to the moves of agents g at 
timestep t + 1. Agent g then starts at the modified grid grido after the moves 
of the agents f) are completed. Note that a token picked up by g at timestep t is 
removed from gridt>; a token can only be picked up once. 
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3 Learning Joint Sequences of Actions 



In this Section we present our findings when reproducing [10], with additional 
details from [6]. Agents in [10] jointly have to learn to explore a grid and ef- 
ficiently retrieve the available tokens. The agents in one step move either up, 
down, right, or left. The order of movement for each of the agents is identical 
in each turn. The tokens on the grid are as defined in Algorithm 1 for a n x n 
grid where we consider the case from [10] where n = 10. A cluster of tokens 
increasing in value towards the edge of the grid is formed in a corner of the grid. 
Two extra tokens of value 1 are then added close to the center of the grid. This 
gives the grid of Figure 1(a) where x marks the initial location of the agents at 
[n/2,n/2]. 

Each individual agent uses Q-learning [8] as its RL algorithm. A learner’s 
input space consists of the location of the agent in the grid and the action 
space consists the four directions in which the agent can move. The policy tt of 
an agent in [10] is stochastic according to a softmax function; in the policy, a 



random action at is chosen for state s and constant c (set at 50) with normalized 
chance in [0, 1] of Q(a,a-) - The discount factor 7 is set to 0.95. The learning 

iZj c ’ ^ 

rate at at time step t for a state s is taken as at = i+o ooo2*'»igzta(s) ’ "''’here 
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Fig. 2. Reprodnced results 

visits(s) is the number of times the state s has been visited during a learning 
step ([6]). The decreasing value of a serves to induce agents to initially explore, 
and then gradually fine-tune their behavior to maximize their utility as a drops. 

In Figure 2 we present the learning curves for the problem setting [10]. With 
the dynamic a. of [10] we reproduced the low utility for the SU of « 0.4. However, 
we found that for a fixed a of 0.1 using the SU, the MAS achieved a fitness of « 
0.8 in 300 epochs. For a small a, the selfish agents are able to focus gradually on 
collecting the tokens even though they directly compete. The large concentration 
of tokens in the corner of the grid acts as an attractor for these agents. In the 
further experiments we present results for a default value of a of 0.1 due to 
generally poor results for a dynamic a In Figure 2, we also show the results 
which correspond to findings in [10] for the WLU in a typical best case. Poorer 
solutions were however found when averaging over 100 runs. 

Our finding suggests that the problem considered in [10] is not as well 
suited to present the additional distributed coordination capabilities of the COIN 
framework as was claimed. We therefore study more general token schemes where 
coordination of the joint actions is a prerequisite for high performance. In the 
further experiments we also each step randomize the order of movement for the 
agents to better simulate a realistic MAS. 

4 Coordinated Grid- World Problems 

Given the good performance of the SU in Section 3, we designed a number 
of token-retrieval problems that in particular play to the weaknesses of selfish 

^ This generally poor performance can be caused by the slow drop in the value of a 
with the number of visits to a state resulting in a high a for initial epochs which 
will cause a system to react strongly to rewards (a drops to 0.2, 0.1 and to 0.05 for 
respectively 20,000, 45,000 and 95,000 visits). Thus, a dynamic a allows for easy 
propagation of reward over regions with little feedback in reinforcement signal, but 
can also lead to strong fluctuations in behavior not suitable for difficult scenarios 
where delicate tuning of behavior is required. 
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(a) Completely filled. (b) Tokens on the edge 

Fig. 3. The two new grids 



agents or have a low signal-to-noise ratio for learning. For both problems, a 
reasonable utility is achieved for a wide selection of parameters using the selfish 
utility or the team game, but the challenge is in achieving near to optimal 
performance. We present two interesting examples in Figures 3(a) and 3(b), 
where x marks the start locations of the agents. 

In the problem of Figure 3(a), the whole grid (of size 10 x 10) is filled with 
tokens of value 1, except for the initial location of the agents. Each agent is 
allowed 10 moves. Thus, the agents have to disperse in order to maximize the 
number of visited nodes on the grid. The world utility is defined as the maximum 
number of tokens the agents can collectively pick up in their moves. Such a grid 
is representative of a situation where agents have little prior information of the 
world and have to devise a strategy for maximum exploration. This problem is 
difficult to solve due to the low signal-to-noise ratio with uniform reward for all 
visited locations. We increase the action-space by allowing the agents to move 
diagonally in order for them to better be able to disperse from their clustered 
initial position. 

For the problem of Figure 3(b), differently valued tokens are placed on the 
edge of a 11 X 11 grid and agents take five steps. Diagonal moves are also possible. 
The agents are hence able to pick up all tokens if they cooperate perfectly by 
each focusing on a distinct token. This problem is representative of a complex 
set of tasks which must all be completed by one of the agents, but the different 
tasks have varying priorities. A solution is difficult to learn as agents may focus 
exclusively on the high priority tokens and neglect to collect the cheap tokens 
needed for high performance. 

In the first coordination problem with eight agents, selfish agents using SU 
achieve a utility of close to 0.8 (Figure 4(a)). With a dynamic (high) a of Section 
3, a high fitness is quickly reached but a higher eventual fitness is reached using 
a fixed a of 0.1. An absolute higher fitness was not found for a wide study of 
parameters. This problem is hence relatively simple for the SU to solve partially, 
but the challenge is in achieving (near to) full utility. The Team Game Utility 
(TG) achieved a comparable performance, but likewise suffers from a low signal- 
to-noise-ratio and is not able to improve performance beyond the presented level. 
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(a) A full grid 



(b) A sparse grid 



Fig. 4. Two new problems 

Agents using the WLU showed slightly better performance with a final util- 
ity of 0.82. This low result was found to be caused by loops in the paths of the 
individual agents promoted by the COIN framework. Due to the softmax func- 
tion for choice of action of Section 3, an agent is averse to actions with negative 
Q values. An agent rj hence quickly learns to avoid penalties imposed by other 
agents t). Agent rj then tends to find a good partial path and revisit it as a 0 
immediate reward^ for an action is superior to receiving a penalty. 

To alleviate this problem, we give an agent rj a penalty for revisiting a state 
s which has been visited at an earlier time step in the same epoch. When agent 
rj visits grid gridt during computation of the WLU as defined in Section 2, 
then instead of a token t picked up being removed from grids gridt> , a penalty 
(— T(t)) is set on these grids gridt> With this approach, an agent learns not 
to revisit an earlier part of its travels as this also results in a penalty. For the 
grid of Section 3(a), in Figure 4(a), the improved high performance with use of 
a penalty for revisiting states is given. In the rest of the paper, the agents using 
the WLU likewise pay a penalty for revisiting an earlier state as otherwise the 
performance of the COIN framework was found to be significantly lower. 

In the second problem of Figure 3(b) agents using the Selfish Utility function 
are attracted to the high token values, making the collective perform poorly. 
As can be seen in Figure 4(b), a maximum fitness of close to 0.5 is temporarily 
achieved as the agents explore in search of good tokens. With increasing competi- 
tion for the high value tokens, the positive reinforcement signal for these targets 
decreases and the agents become indecisive. For the TG, a maximum fitness 
(even after 50, 000 epochs) of « 0.7 is slowly reached as the agents are unable 
to effectively target a token due to the low signal-to-noise ratio. However, when 
using the WLU with a penalty for revisiting states, the agents are able to learn 
to pick up all the tokens (Figure 4(b)). A fitness of 0.5 is quickly reached and 

® All tokens on this earlier part of the path have been retrieved. 

The penalty for revisiting a state may have to increase as the grid becomes more 
crowded and the agent needs an incentive to explore beyond an earlier successful 
route and across the negative penalties deposited by neighboring agents. 
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after an agent has chosen or won a token, the WLU issues sufficiently consistent 
penalties to convince the competing agents to look elsewhere on the grid. 

Summarizing, in this section we have shown how the extended COIN frame- 
work outperforms the SU and TG for two illustrative problems. Agents in the 
COIN approach through the WLU are able to solve a general distribution prob- 
lem where they overcome the low signal-to-noise ratio which limits the perfor- 
mance of the SU and TG. The agents using the WLU are also able to solve a 
difficult collaborative task where high priorities for a selection of task attracts 
naive learners at the expense of other tasks. 

5 Scaling COIN 

The RL algorithm used in [10] is plain Q-learning. Effectively, the update of 
the value-function after a move only considers the immediate reward and a val- 
uation of the next state. It is well known that for agents optimizing grid-like 
world-exploration problems or learning sequences of actions, more effective RL 
algorithms have been developed (that take into account the expected future 
rewards of a sequence of moves). Wolpert et al. based their COIN framework 
on the presumption of individual agents using “sufficiently powerful” RL algo- 
rithms. In this Section, we explore several possibilities for using more powerful 
RL algorithms. 

5.1 Enhanced RL Techniques 

Watkins Q(A): First, we applied Watkins’ Q(A) Learning [12,8] in a COIN set- 
ting. Q(A) learning has been reported to substantially improve on the results for 
learning for single agent applications. Through the use of eligibility traces [12,8], 
a single agent can more efficiently propagate its experienced reinforcement sig- 
nals through its Q- values. In the COIN framework, the propagation of a penalty 
produced by an agent rj is expected to also more efficiently be propagated using 
Q(A) over the paths of agents that interfere with the activities of rj. 

Within the WLU framework, we can devise an alternative to Q(A) in the form 
of Temporal Propagation of Penalties. Temporal Penalty (TP) propagation 
works as follows: in the WLU, a penalty is incurred by the learner if it picks up 
a token at the same time step as one of the baseline agents would. Recall that 
the reward at time step t for agent r] in Section 2 is defined as WLT°^{(). The 
reward (or penalty) for agent rj is determined in an interwoven fashion with 
the moves made by the other agents in the same time step. A consequence of 
this definition is that an agent 77 is not penalized if it picks up a token at time 
t which one of the other agents i) is planning to pick up at a later time step 
tn > t. We can however temporally propagate a penalty for snatching any token 
by rj from the other agents 77 : let 5'(L, 0) be the set of tokens picked up during 
movement. Then for Sg = S{Lg,0), our modified reward for agent rj at time 
step t of TPWLT°^{() is defined as: 

GRg^O + T{S{L^,t, O) \ Sg) - T{S{L^,u O) D Sg). 



( 2 ) 
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The above modified utility function induces an agent rj to consider all future 
actions of the other agents r), and not just those that coincide for specific time 
steps. This can support a stronger cooperation between the agents. 

Additionally, convergence of learning for an agent 77 can possibly be enhanced 
through multiple epochs of learning relative to the other agents r} not only once, 
as defined in Section 2, but multiple times within one epoch: Model Based 
Planning. The potential expensive calculation of WLT°^{() of Section 2 for 
each agent rj is used several times by rj to learn how to optimally behave relative 
to the other agents. Agent 77 traverses (copies of) the grids grido to gridt not 
once, but n times during learning in one epoch. In [13], a similar model-based 
approach is used where agents learn according to a generalized DYNA-Q [8] 
architecture by interleaving of planning according to a learned world model and 
acting based on this model. 

5.2 Results 

As a case study for these more powerful RL algorithms within a COIN frame- 
work, we used the joint coordinated exploration problem of Figure 3(b). The 
tokens on the grid are however placed a distance of one from the edge, making 
the problem more illustrative by giving more opportunity for (discounted) feed- 
back from the received rewards (as there are more successful paths for an agent 
to pick up a token from its initial position) . 

As shown in Figure 5(a), Q(A) aids the WLU in finding a solution to the 
problem. Convergence speed is improved significantly and full utility is reached. 
We found near identical performance gain relative to Q(A) for Temporal Prop- 
agation of Penalties. This similar gain is hypothesized to have roots in similar 
propagation of rewards. For Q(A), discounted penalties are carried over to bor- 
dering states in a traveled path, whereas for the temporal penalty propagation, 
penalties are carried over from states that will be visited near in the future. 

Figure 5(b) shows the improved convergence when using model-based plan- 
ning. Convergence is speeded up considerably as the number of learning itera- 
tions for one agent 77 relative to the other agents is increased from one (standard) 
to two to three. For larger problems, for example a larger grid, added iterations 
did further increase converge properties of the system. Preliminary results for 
more complex scenarios indicate that differentiating the frequency of model- 
based learning in accordance with the strength of the absolute experienced re- 
ward of 77 can further significantly speed up convergence (a strong reward is an 
indication of whether the agent is on the right track or should do something 
entirely else). By proportioning the learning of rj relative to the experienced re- 
ward, 77 can benefit better from learning from the experienced rewards relative 
to the actions of 77. 

6 Discussion and Conclusion 

In this paper we studied the Collective INtelligence (COIN) framework of 
Wolpert et. al. We reproduced [10] as a case study where agents use standard 
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(a) Q(A). (b) Model-based learning. 

Fig. 5. Extended RL applications 

Reinforcement Learning (RL) techniques to collectively pick up tokens on a grid. 
We observed that for more complex problems the COIN framework is able to 
solve difficult MAS problems where fine-grained coordination between the agents 
is required, in contrast to multi-agent systems that use less advanced decentral- 
ized coordination. 

We enhanced the COIN formalism to avoid pathological situations due to 
the nature of the WL utility- function. In particular we discounted actions loop- 
ing back to earlier action sequences to promote a unique path traveled by an 
agent. This enhancement resulted in near optimal fitness for difficult token re- 
trieval actions. Furthermore, we investigated the use of more powerful RL tech- 
niques within the (enhanced) COIN framework. We explored use of Watkins 
Q(A) learning, model based learning, and the extended use of penalties in COIN 
over sequences of actions. All three extensions led to improved performance for 
the problems investigated and demonstrate methods for further improving the 
performance of the COIN framework in larger, more complex applications. 

As future work we consider boot-strapping techniques for single agent RL 
to the COIN framework. RL in general can significantly benefit from directed 
exploration ([5,9] and [16]). Sub-goal detection as in [7] can also greatly speed 
up the learning of complex tasks. For example, in [7] an agent learns to focus 
in learning on critical points in the task which form bottlenecks for good overall 
performance. An open question is how the above work can be integrated in the 
(extended) COIN Framework for task with bottlenecks occurring due to dynamic 
interactions in a MAS. 
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Abstract. Rademacher penalization is a modern technique for obtain- 
ing data-dependent bounds on the generalization error of classifiers. It 
would appear to be limited to relatively simple hypothesis classes be- 
cause of computational complexity issues. In this paper we, nevertheless, 
apply Rademacher penalization to the in practice important hypothesis 
class of unrestricted decision trees by considering the prunings of a given 
decision tree rather than the tree growing phase. Moreover, we general- 
ize the error-bounding approach from binary classification to multi-class 
situations. Our empirical experiments indicate that the proposed new 
bounds clearly outperform earlier bounds for decision tree prunings and 
provide non-trivial error estimates on real-world data sets. 



1 Introduction 

Data-dependent bounds on generalization error of classifiers are bridging the gap 
that has existed between theoretical and empirical results since the introduction 
of computational learning theory. They allow to take situation specific informa- 
tion into account, whereas distribution independent results need to hold for all 
imaginable situations. Using Rademacher complexity [1,2] to bound the gener- 
alization error of a training error minimizing classifier is a fairly new approach 
that has not yet been tested in practice extensively. 

Rademacher penalization is in principle a general method applicable to any 
hypothesis class. However, in practice it does not seem amenable to complex hy- 
pothesis classes because the standard method for computing Rademacher penal- 
ties relies on the existence of an empirical risk minimization algorithm for the 
hypothesis class in question. The first practical experiments with Rademacher 
penalization used real intervals as the hypothesis class [3]. We have applied 
Rademacher penalization to two-level decision trees [4], which can be learned 
efficiently in the agnostic PAG model [5]. 

General decision tree growing algorithms are necessarily heuristic because of 
the computational complexity of finding optimal decision trees [6] . Moreover, the 
hypothesis class consisting of unrestricted decision trees is so vast that traditional 
generalization error analysis techniques cannot provide non-trivial bounds for it. 
Nevertheless, top-down induction of decision trees by, e.g., G4.5 [7] produces 
results that are very competitive in prediction accuracy with better motivated 
approaches. We consider the usual two-phase process of decision tree learning; 
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after growing a tree, it is pruned in order to reduce its dependency on the 
training data and to better reflect characteristics of future data. By the practical 
success of decision tree learning, prunings of an induced decision tree have to be 
considered an expressive class of hypotheses. 

We apply Rademacher penalization to general decision trees by considering, 
not the tree growing phase, but rather the pruning phase. The idea is to view 
decision tree pruning as empirical risk minimization in the hypothesis class con- 
sisting of all prunings of an induced decision tree. First a heuristic tree growing 
procedure is applied to training data to produce a decision tree. Then a pruning 
algorithm, for example the reduced error pruning (Rep) algorithm of Quinlan 
[8] , is applied to the grown tree and a set of pruning data. As Rep is known to be 
an efficient empirical risk minimization algorithm for the class of prunings of a 
decision tree, it can be used to compute the Rademacher penalty for this hypoth- 
esis class. Thus, by viewing decision tree pruning as empirical risk minimization 
in a data-dependent hypothesis class, we can bound the generalization error of 
prunings by Rademacher penalization. We also extend this generalization error 
analysis framework to the multi-class setting. 

Our empirical experiments show that Rademacher penalization applied to 
prunings found by Rep provides reasonable generalization error bounds on real- 
world data sets. Although the bounds still overestimate the test set error, they 
are much tighter than the earlier distribution independent bounds for prunings. 

This paper is organized as follows. In Section 2 we recapitulate the main idea 
of data-dependent generalization error analysis. We concentrate on Rademacher 
penalization which we extend to cover the multi-class case. Section 3 concerns 
pruning of decision trees, reduced error pruning of decision trees being the main 
focus. Related pruning approaches are briefly reviewed in Section 4. Combining 
Rademacher complexity calculation and decision tree pruning is the topic of 
Section 5. Empirical evaluation of the proposed approach is presented in Section 
6 and. Anally, Section 7 presents the concluding remarks of this study. 



2 Rademacher Penalties 

Let S = { {xi,yi) | i = 1, . . . , n } be a sample of n examples {xi, Ui) € X x y each 
of which is drawn independently from some unknown probability distribution on 
X xy. In the PAC and statistical learning settings one usually assumes that the 
learning algorithm chooses its hypothesis h: X y from some fixed hypothesis 
class 'H. Under this assumption generalization error analysis provides theoretical 
results bounding the generalization error of hypotheses h G TL that may depend 
on the sample, the learning algorithm, and the properties of the hypothesis class. 
We consider the multi-class setting, where y may contain more than two labels. 

Let P be the unknown probability distribution according to which the ex- 
amples are drawn. The generalization error of a hypothesis h is the probability 
that a randomly drawn example (x,y) is misclassifled: 



ep{h) = P{h{x) yf y). 
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The general goal of learning, of course, is to find a hypothesis with a small gener- 
alization error. However, since the generalization error depends on P, it cannot 
be computed directly based on the sample alone. We can try to approximate the 
generalization error of h by its training error on n examples: 

1 ” 

en{h) = 

n ^ ' 

1=1 

where (. is the 0/1 loss function for which £(?/, y') = 1 \i y ^ y' and 0 otherwise. 

Empirical Risk Minimization (ERM) [9] is a principle that suggest choosing 
the hypothesis h G Ji with minimal training error. In relatively small and simple 
hypothesis classes finding a minimum training error hypothesis is computation- 
ally feasible. To guarantee that ERM yields hypotheses with small generalization 
error, one can try to bound sup^g^ ~ in{h)\. Under the assumption that 

the examples are independent and identically distributed (i.i.d.), whenever 'H 
is not too complex, the difference of the training error of the hypothesis h on 
n examples and its true generalization error converges to 0 in probability as n 
tends to infinity. 

The most common approach to deriving generalization error bounds is based 
on the VC dimension of the hypothesis class. The problem with this approach 
is that it provides optimal results only in the worst case — when the underlying 
probability distribution is as bad as it can be. Thus, the generalization error 
bounds based on VC dimension tend to be overly pessimistic. Moreover, the VC 
dimension bounds are hard to extend to the multi-class setting. Data-dependent 
generalization error bounds, on the other hand, can be provably almost optimal 
for any given domain [1]. In the following we review the foundations of a recent 
promising approach to bounding the generalization error. 

A Rademacher random variable takes values -1-1 and —1 with probability 1/2 
each. Let ri, r 2 , . . . , r„ be a sequence of Rademacher random variables indepen- 
dent of each other and the data (xi,yi ), . . . , {xn,yn)- The Rademacher penalty 
of the hypothesis class H is defined as 

1 ” 

-y2rii{h{xi),yi) . 

n ^ ' 

i=l 

The following symmetrization inequality [10], which covers also the multi-class 
setting, connects Rademacher penalties to generalization error analysis. 

Theorem 1. The inequality 



RniTL) = sup 
hen 



E 



sup \ep{h) 
hen 



en{h)\ 



<2E [Rn{n)] 



holds for any distribution P, number of examples n, and hypothesis class TL . 

The random variables sup^g^ — ^n{h) \ and are sharply concen- 

trated around their expectations [1]. The concentration results are based on the 
following McDiarmid’s bounded difference inequality [11]. 
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Lemma 1 (McDiarmid’s inequality). Let Zi, . . . ,Zn be independent random 
variables taking their values in a set A. Let f: > R &e a funetion such that 

over all zi, . . . , Zn, € A 

sup \f{zi,. ..,Zi,...,Zn)~ f(Zi,. ..,z',...,Zn)l < Ci 
for some constants ci, . . . , c„ G K. Then for all e > 0 

P (/(Zi, . . . , Z„) - E [/(Zi, . . . , Z„)] > e) and 
P(E[/(Zi,...,Z„)]-/(Zi,...,Z„) >e) 

are upper bounded by exp(— 2e^/ cf). 

Using McDiarmid’s inequality one can bound the generalization error of hy- 
potheses using their training error and Rademacher penalty as follows. 

Lemma 2. Let h G Tl be arbitrary. Then with probability at least 1 — 5 

ep{h) < en(h) +2Rni'H) + 5ri{d,n), (1) 

where rj{S, n) = -\/ln(2/5) /(2n) is a small error term that goes to zero as the 
number of examples increases. 

Proof. Observe that replacing a pair {{xi,yi),ri) consisting of an example {xi, pi) 
and a Rademacher random variable by any other pair ((a;', y'), r') may change 
the value of RnfTL) by at most 2/n. Thus, Lemma 1 applied to the i.i.d. random 
variables ((xi, yi), ri), . . . , ((x„, yn),Xn) and the function Rn{TL) yields 



P(R„(?t) <E[R„(?t)]-2y(5,n) ] < -. 



(2) 



Similarly, changing the value of any example (xi^yf) can change the value of 
sup^g^ \ep{h) — e„(h)| by no more than \/n. Thus, applying Lemma 1 again to 
{xi,yi ), . . . , {Xn,yn) and sup^g^ \(^p{h) - e„(h)| gives 



P sup \ep{h) - en{h)\ > E 
Kheu 



sup \ep{h) - en{h)\ 
hGH 



+ v{S,n) ] < -. 



(3) 



To bound the generalization error of a hypothesis g GTL observe that 

ep(5) < en(y) + sup \ep{h) - e„(h)|. 
hen 

Hence, by inequality (3), with probability at least 1 — 5/2 



ep{g) < £n{g) +'E sup |ep(/i) - e„(h)| 
hen 

< en(y) + 2E [RniTL]] + g{S, n), 



T]{S, n) 



where the second inequality follows from Theorem 1. Finally, applying inequality 
(2) yields that with probability at least 1 — 5 



ep(ff) < en(y) + 2R„('H) -I- 5?7(5, n). 
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The usefulness of inequality (1) stems from the fact that its right-hand side 
depends only on the training sample and the Rademacher random variables but 
not on P directly. Hence, all the data that is needed to evaluate the generalization 
error bound is available to the learning algorithm. Furthermore, Koltchinskii [1] 
has shown that in the two-class situation the Rademacher penalty can be com- 
puted by an empirical risk minimization algorithm applied to relabeled training 
data. We now extend this method to the multi-class setting. 

The expression for Rn{'H) is first written as the maximum of two suprema 
in order to remove the absolute value inside the original supremum: 



RniTi) = max 



( sup ± — 
hen n 



i=l 



The sum inside the supremum with positive sign is maximized by the hypothesis 
hi that tries to correctly classify those and only those training examples (xi,yi) 
for which = —1. To formalize this, we associate each class y G y with a 
complement class label y that represents the set of all classes but y. We denote 
the set of these complement classes by y and extend the domain of the loss 
function £ to cover pairs (y,z) G y x y by setting £(y,z) = 1 if z = y and 0 
otherwise. Using this notation, hi is the hypothesis that minimizes the empirical 
error with respect to a newly labeled training set { (xi,Zi) }^^i, where 



= 




if Ti = -1; 
otherwise. 



The case for the supremum with negative sign is similar. 

Altogether, the computation of the Rademacher penalty entails the following 
steps. 

— Toss a fair coin n times to obtain a realization of the Rademacher random 
variable sequence ri, . . . , r„. 

— Change the label yi to yi if and only if = -|-1 to obtain a new sequence of 
labels Zi, ... ,Zn- 

— Find functions hi,h 2 GR that minimize the empirical error with respect to 
the set of labels Zi and Zi, respectively. Here, we follow the convention that 
z = z for all z G yuy. 

— The Rademacher penalty is given by the maximum of |{ f = -|-1 }| /n — 
e{hi) and \{i : ri = —1 }| In — e(/i 2 ), where the empirical errors e{hi) and 
e(/i 2 ) are with respect to the labels Zi and respectively. 

In the two-class setting, the set y of all classes but j/, 3^ \ { y }, is a singleton. 
Thus, changing class y to y amounts to flipping the class label. It follows that a 
normal ERM algorithm can be used to And the hypotheses hi and ft .2 and hence 
the Rademacher penalty can be computed efficiently provided that there exists 
an efficient ERM algorithm for the hypothesis class in question. 

In the multi-class setting, however, a little more is required, since the sample 
on which the empirical risk minimization is performed may contain labels from 
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y and the loss function differs from the standard 0/1-loss. This, however, is not 
a problem with Rep nor with T2, a decision tree learning algorithm used in our 
earlier study, since both empirical risk minimization algorithms can easily be 
adapted to handle this more general setting as explained in the next section for 
Rep and argued by Auer et al. [5] for T2. 



3 Growing and Pruning Decision Trees 

A common approach in top-down induction of decision trees is to first grow a 
tree that fits the training data well and, then, prune it to reflect less the peculiar- 
ities of the training data — i.e., to generalize better. Many heuristic approaches 
[8,12,13] as well as more analytical ones [14,15] to pruning have been proposed. 
A special class of pruning algorithms are the on-line ones [16,17]. Even these 
algorithms work by the two-phase approach: An initial decision tree is fitted to 
the data and its prunings are then used as experts that collectively predict the 
class of observed instances. 

Reduced error pruning was originally proposed by Quinlan [8]. It has been 
used rather rarely in practical learning algorithms mainly because it requires 
part of the available data to be reserved solely for priming purposes. However, 
empirical evaluations of pruning algorithms indicate that the performance of Rep 
is comparable to other more widely used pruning strategies [12,13]. In analyses 
Rep has often been considered a representative pruning algorithm [13,18]. It 
produces an optimal pruning of a given tree — the smallest tree among those 
with minimal error (with respect to the set of pruning examples) [13,19]. 

Table 1 presents the Rep algorithm in pseudo code (for simplicity only for 
decision trees with binary splits). It works in two phases: First the set of pruning 
examples S is classified using the given tree T to be pruned. The node statistics 
are updated simultaneously. In the second phase — a bottom-up pruning phase — 
those parts of the tree that can be removed without increasing the error of the 
remaining hypothesis are pruned away. The pruning decisions are based on the 
node statistics calculated in the top-down classification phase. 

The scarceness of (expensive) data used to be considered a major problem 
facing inductive algorithms. Therefore, Rep’s requirement of a separate pruning 
set of examples has been seen prohibitive. Nowadays the situation has turned 
around: In data mining abundance of data is considered to be a major problem 
for learning algorithms to cope with. Thus, it should not be a major obstacle to 
leave some part of the data aside from the decision tree building phase and to 
reserve it for pruning purposes. 

Rep is an ERM algorithm for the hypothesis class consisting of all prunings 
of a given decision tree (for a proof, see [19]). Thus, it can be used to efficiently 
compute Rademacher penalties and, hence, also generalization error bounds for 
the class of prunings of a decision tree. This leads us to the following strategy. 
First, we use a standard heuristic decision tree induction algorithm (C4.5) to 
grow a decision tree based on a set of training examples. The tree serves as a 
representation of the data-dependent hypothesis class that consists of its prun- 
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Table 1. The Rep algorithm capable of handling complement labels also. The algo- 
rithm first classifies the pruning examples in a top-down pass using method classify 
and then, during a bottom-up pass, prunes the tree using method prune 

decTree REP( decTree T, exArray S ) // Prune the tree 
for( i= 0 to S. length-1 ) classify ( T, S[i] ); 
prune ( T ) ; return T ; 

void classify ( decTree T, example e ) // Update node counters top-down 
T.total++; T . count [e . label] ++; 
if( !leaf(T) ) 

if ( T.test(e)==0 ) classifyC T.left, e ); 
else classifyC T. right, e ); 

int errorC label y, cntArray count ) // Compute classification error 
int errors= 0; 

foreachC z in Y— fy} ) errors+= count [z] ; 
return errors + count [bar (y)] ; 

int pruneC decTree T ) // Output classification error after pruning 
int leafError= error ( T. label, T. count ); 
if ( leaf(T) ) return leaf Error; 

int treeError= pruneC T.left )+ pruneC T. right ); 
if ( treeError < leaf Error ) return treeError; 
else replace T with a leaf labeled T. label; 
return leaf Error; 



ings. As C4.5 usually performs quite well on real-world domains, it is reasonable 
to assume — even though it cannot be proved — that the class of prunings 
contains some good hypotheses. 

Having grown a decision tree, we use a separate pruning data set to select 
one of the prunings of the grown tree as our final hypothesis. In this paper, we 
use Rep as our pruning algorithm, but in principle any other pruning algorithm 
using the same basic pruning operation could be used as well. However, since 
Rep is an empirical risk minimization algorithm, the derived error bounds will 
be the tightest when combined with it. 

Our view on pruning is similar to that of Esposito et al. [20], who viewed 
many decision tree pruning algorithms as instantiations of search in the state 
space consisting of all prunings of a given decision tree, the state transition 
function being determined by the basic pruning operation. In this setting. Rep 
can be seen as a search algorithm whose bias is determined by the ERM principle 
and the tendency to favor small hypotheses. Our goal, however, is not to analyze 
the search itself, but to evaluate the goodness of the final pruning produced by 
the search algorithm. We pursue this goal further in Section 5. 

One shortcoming of the two-phase decision tree induction approach is that 
there does not exist any well-founded approach for deciding how much data to 
use for the training and pruning phases. Only heuristic data set partitioning 
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schemes are available. However, the simple rule of using, e.g., two thirds of the 
data for training and the rest for pruning has been observed to work well in 
practice [13]. If the initial data set is very large, it may be computationally 
infeasible to use all the data for training or pruning. In that case one can use 
heuristic sequential sampling methods for selecting the size of the training set 
and determine the size of the pruning set, e.g., by using progressive Rademacher 
sampling [4] . Because Rep is an efficient linear-time algorithm, it is not hit hard 
by overestimated pruning sample size. 



4 Related Pruning Algorithms 

Rep produces the smallest of the most accurate prunings of a given decision tree, 
where accuracy is measured with respect to the pruning set. Other approaches 
for producing optimal prunings for different optimality criteria have also been 
proposed [21,22,23,24]. However, often optimality is measured over the training 
set. Then it is only possible to maintain the initial training set accuracy, assuming 
that no noise is present. Neither is it usually possible to reduce the size of 
the decision tree without sacrificing the classification accuracy. For example, 
Bohanec and Bratko [22] as well as Almuallim [24] have studied how to efficiently 
find the smallest pruning that satisfies a given minimum accuracy requirement. 

The strategy of using one data set for growing a decision tree and another for 
pruning it closely resembles the on-line pruning setting [16,17]. In it the prun- 
ings of the initial decision tree are viewed as a pool of experts. Thus, pruning 
is performed on-line, while giving predictions to new examples, rather than in a 
separate pruning phase. The main advantage of the on-line methods is that no 
statistical assumptions about the data generating process are needed and still 
the combined prediction and pruning strategy can be proven to be competitive 
with the best possible pruning of the initial tree. These approaches do not choose 
or maintain one pruning of the given decision tree, but rather a weighted com- 
bination of prunings which may be impossible to interpret by human experts. 
The loss bounds are meaningful only for very large data sets and there exists no 
empirical evaluation of the performance of the on-line pruning methods. 

The pruning algorithms of Mansour [14] and Kearns and Mansour [15] are 
very similar to Rep in spirit. The main difference with these pruning algorithms 
and Rep is the fact that they do not require the sample S on which pruning 
is based to be independent of the tree T; i.e., T may well have been grown 
based on S. Moreover, the pruning criterion in both methods is a kind of a cost- 
complexity condition [21] that takes both the observed classification error and 
(sub)tree complexity into account. Both algorithms are pessimistic. They try 
to bound the true error of a (sub)tree by its training error. Since the training 
error is by nature optimistic, the pruning criterion has to compensate it by being 
pessimistic about the error approximation. 

Both Mansour [14] and Kearns and Mansour [15] provide generalization error 
analyses for their algorithms. The bound presented in [14] measures the com- 
plexity of the class of prunings by the size of the unpruned tree. If this size or an 
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upper bound for it is known in advance, the bound applies also when the pruning 
data is not independent of the tree to be pruned. Mansour’s bound can be used 
in connection with Rep, too, and we will use it as a point of comparison for our 
generalization error bounds in Section 6. Kearns and Mansour [15] prove that 
the generalization error of the pruning produced by their algorithm is bounded 
by that of the best pruning of the given tree plus a complexity penalty. However, 
the penalty term can grow intolerably large and cannot be evaluated because of 
its dependence on the unknown optimal pruning and hidden constants. 



5 Combining Rademacher Penalization 
and Decision Tree Pruning 

When using Rep, the data sets used in growing the tree and pruning it are 
independent of each other. Therefore, any standard generalization error analysis 
technique can be applied to the pruning found by Rep as if the hypothesis class 
from which Rep selects a pruning was fixed in advance. A formal argument 
justifying this would be to carry out the generalization error analysis conditioned 
on the training data and then to argue that the bounds hold unconditionally by 
taking expectations over the selection of the training data set. 

By the above argument, the theory of Rademacher penalization can be ap- 
plied to the data-dependent class of prunings. Therefore, we can use the results 
presented in Section 2 to provide generalization error bounds for prunings found 
by Rep (or any other pruning algorithm). Moreover, since Rep is a linear-time 
ERM algorithm for the class of prunings, it can be used to evaluate the gener- 
alization error bounds efficiently. 

To summarize, we propose the following decision tree learning strategy that 
provides a generalization error bound for the hypothesis it produces: 

— Split the available data into a growing set and a pruning set. 

— Use, e.g., C4.5 (without pruning) on the growing set to induce a decision 
tree. 

— Find the smallest most accurate pruning of the tree built in the previous 
step using Rep (or any other pruning algorithm) on the pruning set. This is 
the final hypothesis. 

— Evaluate the error bound as explained in Section 2 by running Rep two more 
times. 

Even though the tree growing process is heuristic, the generalization error 
bounds for the prunings are provably true under the i.i.d. assumption. They are 
valid even if the tree growing heuristic fails, that is, when none of the prunings of 
the grown tree generalize well. In that case the bounds are, of course, unavoidably 
large. The situation is similar to, e.g., margin-based generalization error analysis, 
where the error bounds are good provided that the training data generating 
distribution is such that a hypothesis with a good margin distribution can be 
found. In our case the error bounds are tight whenever C4.5 works well for the 
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data-generating distribution in question. The empirical evidence overwhelmingly 
demonstrates that C4.5 usually fares quite well. 

Generalization error bounds can be roughly divided into two categories: 
Those based on a training set only and those requiring a separate test set [25] . 
Our generalization error bounds for primings may be seen to lie somewhere be- 
tween these two extremes. We use only part of the data in the tree growing 
phase. The rest — the set of pruning data — is used for selecting a pruning and 
evaluating the generalization error bound. Thus, some of the information con- 
tained in the pruning set may be lost as it cannot be used in the tree induction 
phase. However, the pruning set is still used for the non-trivial task of selecting 
a good pruning, so that some of the information contained in it can be exploited 
in the final hypothesis. The pruning set is thus used as a test set for the outcome 
of the tree growing phase and also as a proper learning set in the pruning phase. 

6 Empirical Evaluation 

The obvious performance reference for the approach of Rademacher penaliza- 
tion over decision tree prunings is to compare it to existing generalization error 
bounds such as the ones presented by Mansour [14] and Kearns and Mansour [15] . 
The bound in the latter is impossible to evaluate in practice because it requires 
knowing the depth and size of the pruning with the best generalization error. 
This leaves us with the bound of Mansour which only requires knowing the 
maximum size of prunings in advance. Bounds developed in the on-line pruning 
setting [16] are incomparable with the one presented in this paper because of 
the different learning model. Thus, they will not be considered here. 

Mansour [14] derived, based on the Chernoff bound, the following bound for 
the generalization error of a decision tree h with k nodes: 



where d is the arity of binary example vectors Xi and c is a constant. The bound 
applies only to binary decision trees in the two-class setting. When used for 
the class of unrestricted multi-class decision trees, the bound will give an overly 
optimistic estimate of what could be obtained with Mansour’s proof technique 
in this more general setting. For the value of c we use a crude underestimate 0.5. 
Both these choices are in favor of Mansour’s bound in the comparison. 

The error bound based on Rademacher penalization depends on the data 
distribution so that its true performance can be evaluated only empirically. As 
benchmark data sets we use six large data sets from the UCI Machine Learning 
Repository, namely the Census income (2 classes). Connect (3 classes), Covertype 
(7 classes), and generated Led datasets (10 classes) with 5, 10, and 15 percent 
attribute noise and 300,000 instances. In each experiment we allocate 10 percent 
of the data for testing and split the rest to growing and pruning sets. As the 
split ratio we chose 2:1 as suggested by Esposito et al. [13]. 
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Table 2. Averages and standard deviations of sizes of trees grown by C4.5 (left) and 
error bounds for Rep (right) over 10 random splits of the data sets 



Data set 


Unpruned 


Default 


Rep 


Test set 


R-bound 


M-bound 


Census 


19732 ±732 


1377 ±268 


4749 ±397 


4.9T0.1 


8.7±0.2 


49.9 ±0.9 


Connect 


10973 ±361 


4253 ±104 


4338 ±235 


20.7T0.8 


32.4 ±0.4 


89.3 ±1.5 


Cover 


25356 ±221 


22095 ±228 


17404 ±179 


6.9T0.1 


12.7±0.1 


44.0 ±0.2 


Led24-5 


27357 ±139 


7042 ±74 


3850 ±233 


13.4 ±0.2 


19.7±0.2 


61.3 ±0.2 


Led24-10 


51790 ±204 


13624 ±220 


7671 ±323 


26.4±0.1 


36.8 ±0.2 


91.7±0.2 


Led24-15 


71162±156 


20273 ±259 


11344 ±265 


38.6 ±0.2 


52.2 ±0.2 


114.6 ±0.2 



Table 2 summarizes the results of our experiments averaged over ten ran- 
dom splits of the data sets. Observe that the unpruned decision trees are very 
large, which means that the class of prunings may potentially be very complex. 
The results indicate that the default pruning of C4.5 and Rep both manage to 
decrease the tree sizes considerably. 

The right-hand side of Table 2 presents the test set accuracies and error 
bounds for Rep prunings based on Rademacher penalization and Mansour’s 
method. In both bounds, we set i5 = 0.01. Even though the bounds based on 
Rademacher penalization clearly overshoot the test set accuracies, they still pro- 
vide reasonable estimates in many cases. Note that in the multi-class settings 
even error bounds above 50 percent are non-trivial. The Rademacher bounds 
are clearly superior to even the underestimates of the bounds by Mansour that 
we used as a benchmark. The amount by which the Rademacher bound over- 
estimates the test set error is seen to be almost an order of magnitude smaller 
than the corresponding quantity related to Mansour’s bound. 

7 Conclusion 

Modern generalization error bounding techniques that take the observed data 
distribution into account give far more realistic sample complexities and gener- 
alization error approximations than the distribution independent methods. We 
have shown how one of these techniques, namely Rademacher penalization, can 
be applied to bound the generalization error of decision tree prunings, also in 
the multi-class setting. According to our empirical experiments the proposed 
theoretical bounds are significantly tighter than previous generalization error 
bounds for decision tree prunings. However, the new bounds still appear unable 
to faithfully describe the performance attained in practice. 
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Abstract. In this paper we show how to learn rules to improve the performance 
of a machine translation system. Given a system consisting of two translation 
functions (one from language A to language B and one from B to A), training 
text is translated from A to B and back again to A. Using these two translations, 
differences in knowledge between the two translation functions are identified, 
and rules are learned to improve the functions. Context-independent rules are 
learned where the information suggests only a single possible translation for a 
word. When there are multiple alternate translations for a word, a likelihood ra- 
tio test is used to identify words that co-occur with each case significantly. 
These words are then used as context in context-dependent rules. Applied on 
the Pan American Health Organization corpus of 20,084 sentences, the learned 
rules improve the understandability of the translation produced by the SDL In- 
ternational engine on 78% of sentences, with high precision. 



1 Introduction 

Machine translation systems are now commonplace. For example, they can be 
found for free on a number of web sites. If we treat these systems as black box 
translation engines where text is input and the translation obtained, can we im- 
prove the translation performance automatically? 

Most previous research in machine translation has focused on developing sys- 
tems from the ground up. Modern systems generally employ statistical and/or 
learning methods ([Melamed, 2001] and [Yamada and Knight, 2001]). A number 
of translation systems are offered commercially not only to businesses, but also 
to anyone with web access ([FreeTranslation, 2002] and [Systran, 2002]). These 
systems are either stand-alone translation engines or integrated into a general 
information processing system ([Damianos et al., 2002]). Although these systems 
typically do not employ state of the art translation methods, they are widely used. 
In this paper, we examine these publicly available systems. The methods we de- 
scribe work well on this type of system, but can also be employed on other ma- 
chine translation systems. 
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The most common machine translation systems do word level translation. Although 
word level methods are the simplest, they have proved surprisingly successful. In an 
investigatory study in [Koehn and Knight, 2001], they show that 90% of the words in 
a corpus can he translated using a straightforward word for word translation. In this 
paper, we examine learning word level correction rules to improve machine transla- 
tion systems. Rule learning approaches have proved successful in other natural lan- 
guage problems because they leverage statistical techniques and also tend to produce 
understandable and interpretable rules ([Brill, 1995]). 

Most machine translation systems can translate in both directions between a lan- 
guage pair. Such a system can be thought of as two different functions, one that trans- 
lates in one direction and a second that translates in the opposite direction. These 
functions are usually developed semi-independently and often the lexicon used by 
each is independent. This results in a difference in the knowledge built into each 
function. In this paper, we propose a method for automatically detecting this knowl- 
edge discrepancy and, using this information, for improving the translation functions. 
Given a word list in language A, we translate those words to language B and back 
again to language A. In some sense, the original word list defines a ground truth for 
the final set of translated words. Deviations from this ground truth point to cases 
where the system can be improved. 

Using this setup, we describe how rules can be learned to improve these translation 
functions. Context-independent rules are learned where there is no ambiguity about 
the translation of a word. For words with multiple possible translations, a corpus is 
used to identify candidate context words and the likelihood ratio test is used to iden- 
tify which of these context words co-occur significantly. Using these significant 
words, context-dependent rules are learned that disambiguate between ambiguous 
cases. 

Using our method, 7,971 context-independent rules and 1,444 context-dependent 
rules are learned. These rules improve the understandability of the translation of 
24,235 words and 78% of the sentences in the Pan American Health Organization 
corpus of over half a million words and 20,084 sentences. 



2 Setup and Terminology 

Before we explain the method for improving machine translation systems, we 
first define some terminology and assumptions. A machine translation system is a 
pair of translation functions (f, /’) where Ly and L 2 are natural languages and 
where / translates from Ly to L 2 and vice versa for/’. We assume that we have 
unlimited access to the translation functions of a machine translation system, but 
not to the details of how the functions operate. We also assume that we have a 
large amount of text available in the languages that the machine translation sys- 
tem translates between. Finally, instead of trying to learn correction rules that 
change an entire sentence at once, we only learn rules that change a single word 
at a time. 

In many situations, doing multiple single word changes leads to results similar 
to full sentence correction. Solving the single-word correction problem involves 
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three different steps. The first step is to identify where a word is being translated 
incorrectly. Given this incorrectly translated word, the second step is to identify 
the correct translation for that word. Finally, given an incorrect translation and 
the appropriate correct translation, the third step is to generate rules capable of 
making corrections in new sentences. 

The first two steps can be seen as data generation steps. These steps generate 
examples that can then be used to generate rules in the third step using some 
supervised learning method. The three steps can be written as follows: 

1. Find mistakes: Find word i, in sentence seL with input context Ci(i,) 

where i, is translated incorrectly to t, with output context C 2 (t, ) . 

2. Find corrections: Find the correct translation, r,, for v, in Tg L with input 
context ) , output context C 2 (t, ) and incorrect translation t,. 

3. Learn correction rules: Generate a correction function g such that 

g{s^,c^{s^),t.,c^{t.)) = r. for each data sample i. A rule fires when 5, is in the 
input sentence with context Ci(i,) and v, is translated to t, with context 
C 2 (tj ) by the original machine translation system. Firing changes t, to r,. 

The contexts described above can be any representation of the context of a word 
in a corpus, but we will use the bag of words of the sentence containing the word. 
Although this loses positional information, it is a simple representation that 
minimizes the parameters required for learning. Our goal is to improve a machine 
translation system, given the assumptions stated above, by solving each of the 
three problems described. The key to our approach is that given a sentence T in a 
language, we can learn information from f{s) and f'(f(s)) . 



3 Analysis of Cases for an Example MX System 

We examine one particular system and the application of the ideas above to im- 
prove this system. There are a number of commercial systems publicly available 
including [Systran, 2002] and [FreeTranslation, 2002]. Although Systran’s sys- 
tem is more widely used, FreeTranslation offers more relaxed requirements on 
the length of the text to be translated. Also, initial comparison showed that results 
on AltaVista, which uses Systran’s translation software, were comparable to the 
results obtained from FreeTranslation. Given a machine translation system, 
(/,/') , we calculate translations f(w) and f'{f(w)) for a set of words w in Lj. 
In our case, we choose Lj = English and L 2 = Spanish. 

Table 1 shows a summary of the data generated using Freetanslation.com in 
February 2003 from 45,192 English words ([Red Hat, 2002]). A partition (i.e. 
non-overlapping, exhaustive set) of the possible outcomes is shown. We examine 
each of these cases and explain how each case provides information for improv- 
ing the translation system (/,/'). Eor many machine translation systems, the 
default when the translation for w is unknown is to translate the word as w (i.e. 
/(w) = w ). Throughout these examples, we will assume that equality implies that 
the system could not translate the word. A message or flag issued by the system 
could be used instead, if available. 

w= f'(f(w))^ f(w): 

In this case, the word w is translated to a different string, f(w ) , in the second 
language. When /(w) is translated back to the original language, it is translated 
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back to the original word w. Generally, in this situation the machine translation 
system is translating these words correctly. Mistakes can still occur here if there 
are complementary mistakes in the lexicon in each direction. There is no informa- 
tion for the three problems described above in this case. 



Table 1. The results from doing a three-way translation of approximately 45,192 English 
words to Spanish and back to English. 





Occurrences 


Example w, fiw), f'ifiw)) 


w = f'ifiw))^ f{w) 


9,330 


dog, perro, dog 


W= fiw)^ f'ifiw)) 


278 


metro, metro, meter 


fiw) = f’ifiw)) 


8,785 


scroll, rollo, rollo 


W= fiw)= f'ifiw)) 


11,586 


abstractness, abstractness, abstractness 


W^ fiw)^ f'ifiw)) 


14,523 


cupful, taza, cup 



w = f{w)^ f'ifiw)): 

In this case, the word w is translated to the same string in the second language; 
however, it is then translated to a different string when it is translated hack to the 
original language. This happens when w is a word in both languages (possibly 
with different meanings), which the translation system is unable to translate to the 
second language (for example, w = arena, /(w) = arena, f'{f{w)) = sand). 
From these examples, we learn that the translation function / should translate 
f'(f(w)) to f(w) (Problem 2). This information may or may not be useful. We 
can query/ to see if this information is already known. 

f(w) = f'(fiw)): 

In this case, the word w is translated from the original language to the second 
language; however, it is then translated as the same word when translated back to 
the original language. There are two cases where this happens. 

1. The most likely situation is that there is a problem with the translation system 

from the second language to the original language (i.e. in /') since the de- 
fault behavior for translating an unknown word is to leave the word untrans- 
lated. In this case, two pieces of information are learned. First, if /(w) is 
seen on the input and is translated to /'(/(w)) then a mistake has occurred 
(Problem 1). We can also suggest the correct translation. Given a sentence 
5 , if word Si is translated to i, and s^ = f'ifiw)) , then i, was incorrectly 
translated and the correct translation for 5, is w (Problem 2). 

2. The second case, which is less likely, is that f{w) is a word that, when 
translated back to the original language, is the same string (this is similar to 
case 2 below ofw= f(w)= f'ifiw))). For example, w = abase, f(w) = de- 
grade (present subjunctive form of degradar, to degrade), f'(f{w)) = de- 
grade. We can learn that /(w) is an ambiguous word that can be translated 
as either w or f'ifiw)) . 
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w = f(w) = f'ifiw)): 

In this case, all the words are the same. There are two common situations. 

1 . If the word for w in the second language is actually w then the translation is 

correct. This is common with proper names (for example, w = Madrid, 
f(w)= Madrid, Madrid). In this case, no information is gained 

to solve the problems listed above. 

2. If the system is unable to translate w, then w= f{w) . If this is the case, then 

it is unlikely that w will actually be a valid word in the second language (as 
shown above, this does happen 278 out of 45,192 times, where the f(w) is 
translated to something different by /’) and so the word again gets translated 
as w in the second translation step (for example, w = matriarchal, f(w) = 
matriarchal, = matriarchal). In this case, the translation function/ 

makes a mistake on word w (Problem 1). 

f{w)^ f'ifiw))^ w: 

There are two situations that may cause this to happen, w may be a synonym for 
f'(fiw)) or there may be at least one error in the translation. If we assume that 
the knowledge in the translation systems is accurate, then both w and f'(f{w)) 
are appropriate translations for f(w) . These two cases can be disambiguated 
using contextual information. 

One last piece of information can be obtained when f(w)^ f'if(w)) . In these 
cases, some translation was done by /'. We can assume that if f'{f(w)) actually 
is a word in the original language. Using this, we can extend the word list in the 
original language. 



4 Rule Learning 

Using the framework described in Section 3, we can learn rules that improve the 
output of a translation system. We learn two different types of rules: context- 
independent and context-dependent. If there is no ambiguity about the translation 
of a word, then context is not required to disambiguate and a context-independent 
rule can be learned. If, on the other hand, there are multiple possible translations, 
then context is required to decide between the different possible translations. 
Figure 1 outlines the algorithm for generating the data and for learning both types 
of rules. 

For preprocessing, the word lists is translated from the starting language to the 
alternate language and back to the original language. Table 2 summarizes the 
information that is used for generating rules from these translations. The input 
words are Ly words. The current translations are the words expected to be seen in 
the output of the translation system. Finally, the correct translations indicate 
which word the output word should be changed to. 

By examining the input words involved in the cases in Table 2, non-ambiguous 
words can be identified where an input word only has one learned correct translation. 
Notice that many of the entries in Table 2 are inherently ambiguous, such as when 
f{w) t- f'(f(w)) . Almost all non-ambiguous words are generated from the case 
when f(w)= f'{f(w)) , where the system knows how to translate /(w) from 
English but does not know how to translate it back to English. 
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Preprocessing steps 

- Translate Li word list from Lj to L2 and back to L; 

- Translate L2 word list from L2 to L; and back to L2 

- Generate input word (L2), current translation (Lj) and correct translation (L2) triplets using 
rules in Table 2 

- For all words, w, in corpus, generate frequency counts, count{w) 

- Translate corpus from Lj to L2 to use for learning contexts 

Generate context-independent rules for non-ambiguous words 

- Identify non-ambiguous words by finding all “input words” with only a single suggested 
correct translation 

- Generate context-independent rules of the form: 

g(input word, [], current translation, []) — > correct translation 

Generate context-independent rules for A^-dominant words 

- Find sets of “input words” that have the same suggested correction translation. These words 
represent possible translation options. Identify k-dominant words where 

count(optioHj) > k and count(optioHj) = 0 for 

- Generate context-independent rules of the form: 

g (input word, [], current translation, []) — > option^ 

Generate context-dependent rules for ambiguous words 

- Get the possible context words f, for each option^ for the remaining ambiguous words 

- In the Lj corpus, find sentences where option;^ appears and the corresponding ambiguous 

word is in the translated sentence in L2 

- Get all possible context words tj as the words surrounding option^ 

- For each option,, generate the context, c{option{), as all tj that pass the significance level a 
threshold for the likelihood ratio test 

- Learn context-dependent rules of the form: 

g (input word, [], current translation, c(optionj )) — > option j 

Fig. 1 . Outline of algorithm to learn rules to improve to L, translation. The preprocessing 
steps generate the initial data for use in learning the rules. The following three sets of steps 
describe the algorithms for learning the context-independent and context-dependent rules. 



Table 2. Patterns for generating rules for Spanish to English improvement. 



Case 


Input word 


Current translation 


CoiTect translation 


Eng Sp Eng 

f{w) = f'{f{w)) 

/^(w) is not an English word 


f(w) 


f'ifiw)) 


W 


Eng Sp Eng 


f(w) 


f'ifiw)) 


W 


w* f(w) = f'(f(w)) 

f (w) is an English word 


/(w) 


f'(fiw)) 


f'ifiw)) 


Eng Sp Eng 


/(w) 


f'ifiw)) 


w 




/(w) 


f'ifiw)) 


f'ifiw)) 


Sp Eng Sp 
w = f(w)* f'(f(w)) 
/'(/(vr)) = /(/'(/(w))) 


f'ifiw)) 


fif'ifiw))) 


f(w) 


Sp Eng Sp 


f'ifiw)) 


fif'ifiw))) 


f(w) 


w= f{w)^ 

^ f(f'ifiw))) f{w) 


/'(/(w)) 


fif'ifiw))) 


fif'ifiw))) 
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For those words where there is only one known translation and therefore no 
amhiguity, a context-independent rule of the form g(^,[],^,[]) = r can be learned, 
where s = input word, t = current translation and r = correct translation. Using 
this methodology, 7,155 context-independent rules are learned from the list of 
45,192 words and the FreeTranslation engine. 



4.1 Dealing with Ambiguous Words 

For the remaining input word, current translation and correct translation triplets, 
there are at least two correct translations for the same input word. We must de- 
cide between these possible correct translations. We suggest two methods that 
both leverage a corpus in the target language, in this case English, to distinguish 
between translation options. 

We would like to identify as many non-ambiguous words in the data as possi- 
ble, since these rules are simpler. To do this, we can use the English corpus 
available. Eor our purposes, we use the Pan American Health Organization cor- 
pus ([PAHO, 2002]) that consists of over half a million words. Counting the oc- 
currences of the possible translations (i.e. correct translation entries) can give 
some indication about which translation options are more likely. We define an 
input word as being k-dominant if one translation option occurs at least k times in 
the text and all other options do not appear at all. When a word is k-dominant, it 
is reasonable to assume that the input word should always be translated as the 
dominant option. We can learn a context-independent rule that states exactly 
this. Using this method, all of the k-dominant words with A: = 5 are learned result- 
ing in an additional 816 context-independent rules. 

For all the input words where there are multiple possible translations and no 
one option is k-dominant, context can be used to disambiguate between the possi- 
ble translations. The rules being learned have the possibility of both an input 
context and an output context. In practice only context in the input or output 
language is necessary. In our case, for Spanish to English improvement, English 
text is more readily available, so only the output contexts will be learned. 

Given an ambiguous input word, a, that has option],..., option,, as possible cor- 
rect translations, the goal is to learn a context for each possible translation, op- 
tion], that disambiguates it from the other translations. We do this by gathering 
words from the English corpus that occur in the same sentences as each of the 
possible translation options optiony,..., option,,. We can use the machine transla- 
tion system to verify that option] actually gets translated to a (and correspond- 
ingly that a gets translated to option]) in that context. 



4.2 Determining Significant Context Words 

The problem described above is the problem of collocation: finding words that 
are strongly associated. Many methods have been proposed for discovering col- 
locations such as frequency counts, mean and variance tests, f-test, test and 
likelihood ratio test ([Manning and Schutze, 1999]). The likelihood ratio test has 
been suggested as the most appropriate for this problem since it does not assume 
a normal distribution like the f-test nor does it make assumptions about the mini- 
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mum frequency counts like the test ([Dunning, 1993]). 

For this problem, we have two different sets of sentences that we are interested in; 
the set Sj of sentences that contain the translation option t., and the set 5, of sentences 
that don’t contain the translation option. The decision is for each context word Wj in 
the sentences belonging to S., whether or not that word is significantly associated with 
the translation option t. or not. 

The likelihood ratio test tests an alternate hypothesis against a null hypothesis. In 
this case, the null hypothesis is that the two groups of sentences (sentences with and 
without f ) come from the same distribution with respect to the occurrence of Wj in the 
sentence. The alternate hypothesis is that the two groups of sentences are different 
with respect to the occurrence of vv . We will also impose the further constraint that Wj 
must be more likely to occur in sentences of S,. 

For each set of sentences, the occurrence of can be modeled using the binomial 
distribution. The assumption is that there is some probability that Wj occurs in a sen- 
tence. For both hypotheses, the likelihood equation is i = p{Si-,0i)p(Si-,02) ■ For the null 
hypothesis, the sentences come from the same distribution and therefore 6^ =0^ = 9. In 
all these situations, the maximum likelihood estimate of the parameter is used (the 
frequencies seen in the training data, in this case the English corpus). Using these 
parameter estimations, the likelihood ratio can be calculated in a similar fashion to 
[Dunning, 1993]. We compare this value with a significance level, a, to make a deci- 
sion about whether the collocation is significant or not. We do this for all words in 
sentences that contain t., then construct context-dependent rules that contain all words 
that pass the significance test in the context. For our experiments, a = 0.001 is an 
appropriate significance level. Intuitively, this means that there is a one in a thousand 
chance of a candidate word being misclassified as significant. 

To improve the generality of the contexts learned, we perform the test on stemmed 
versions of the words and generate context-dependent rules using these stemmed 
words. The Porter stemmer ([Porter, 1980]) is used to stem the words. For the re- 
mainder of the paper, the results provided are for the stemmed versions. 

5 Results 

In this section, we examine the success of the learned rules in a real domain. We 
examine the Pan American Health Organization (PAHO) Conferences and Gen- 
eral Services Division parallel texts. This data set consists of 180 pairs of docu- 
ments in English and Spanish ([PAHO, 2002]). 

The 180 Spanish documents consist of 20,084 sentences, identified by periods 
(minus a number of Spanish abbreviations), and 616,109 words, identified by sur- 
rounding white space. The sentences are translated using EreeTranslation.com to get 
the initial translation. Then, the rules learned using the algorithms in Section 4 are 
applied to change the sentences. For the context-independent rules, the rule fires 
anytime the appropriate words are seen in the original sentence and translated sen- 
tence. The context-dependent rules add the additional restriction that the translated 
sentence must also contain one of the words in the learned output context of the rule 
to fire. 
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Spanish: 

El contenido de alquitran en los cigarrillos de tabaco negro sin filtro es mayor que en los restantes 
tipos de cigarrillos y son aquellos precisamente los de mayor consumo en la poblacion, lo que 
aumenta la potencialidad del tabaquismo como factor de riesgo. 

Original translation: 

The content of alquitran in the black cigarettes of tobacco without filter is greater that in the 
remaining types of cigarettes and are those precise the of greater consumption in the population, 
what enlarges the potencialidad of the tabaquismo as factor of risk. 

Improved translation: 

The content of tar in the black cigarettes of tobacco without filter is greater that in the remaining 
types of cigarettes and are those precise the of greater consumption in the population, what enlarges 
the potencialidad of the tabaquismo as factor of risk. 

Fig. 2. Example of an improvement in translation. The first sentence is the original Spanish 
sentence to be translated. The second sentence is the translation made by FreeTranslation.com. 
The final sentence is the translation after a learned improvement rule has been applied. The 
change is in bold. 



Table 3. Summary of results for rules generated from a word list with 45,192 entries. 



Rule type 


Rules 

learned 


Avg. # 
words in 
context 


Rules 

used 


Words 

changed 


Context independent 


6,783 


NA 


701 


5,022 


Context independent, dominant k = 5 


809 


NA 


191 


4,768 


Context dependent, a = .001 


1,355 


5 


301 


12,416 



Over 9,000 rules are learned. Table 3 shows the results from applying these 
rules to the sentences. The rules change 22,206 words in the PAHO data set and 
14,952 or 74% of the sentences. Figure 2 shows an example firing of a context- 
independent rule that changes “alquitran” to “tar”. 



6 Using Extended Word Lists 

The methods in this paper are based on having a word list in language Lj. In this 
section, we present two methods for extending this word list. One of the advan- 
tages of the rule learning method described above is that it is robust to erroneous 
words in the word list. If the system does not recognize a word in the word list 
then it will not get translated, as is the case where w= f{w) = f'{f(w)) . No 
learning is done in this case, so erroneous words are filtered out by the machine 
translation system. Since a high quality word list is not required, the word lists 
can be constructed at least two different ways. 

When translating w to/(w) and back to the original language as/'(/(w)), if 

f(w)^ f'(f(w)) then some translation was done between /(w) and 
Given the robustness of the learning system, we can assume that if the machine trans- 
lation system translates /(w) to f'{f(w)) , then/'(/(w)) is a word in the original 
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language. Using this method, 419 additional words not in the original word list are 
learned. 

In many circumstances, translation systems are to be used in a specific domain 
(for example medicine, politics, public health, etc.). The PAHO data set men- 
tioned earlier contains documents in the public health domain. To improve the 
recall of the machine translation we can incorporate more rules that contain ter- 
minology that is relevant to this particular domain. We can do this by examining 
words in a corpus of a similar domain to add to the word list. In our case, since 
the PAHO data set contains the parallel text in English, we can use this text. The 
English version of the PAHO data set contains 5,215 new words that are not in 
the original word list. 

Table 4 shows the results of learning rules with the original 45,192 words plus 
the 419 learned words and the 5,215 domain specific words. The additional 
words add 468 new rules. Although these new rules only constitute a small frac- 
tion of the total rules (~5%) they account for over 8% of the changes. In particu- 
lar, the new, domain specific context-independent rules fire over four times more 
often than the rules learned from a generic word list. Because these additional 
rules are learned using domain specific words, they are much more likely to apply 
for translating text in that particular domain. With the addition of these new 
rules, 78% of the sentences are changed. 



Table 4. Summary of results for rules generated using a general word list with 45,000 entries 
plus 419 learned words and 5,215 domain specific words. 



Rule type 


Rules 

learned 


Avg. # 
words in 
context 


Rules 

used 


Words 

changed 


Context independent 


7,155 


NA 


903 


6,526 


Context independent, dominant k = 5 


816 


NA 


200 


5,038 


Context dependent, a = .001 


1,444 


5 


327 


12,671 



7 Discussion 

In this paper, we have examined a technique for improving a machine translation 
system using only plain text. One of the advantages of this approach is that the 
resources required to learn the rules are easier to obtain than traditional ap- 
proaches that require aligned bitext ([Macklovitch and Hannan, 1996]). Also, our 
method makes no assumptions about the workings of the translation system. 

By translating words from the original language to the second language and 
back to the original language, differences in information between the two transla- 
tion functions are isolated. Using this information, we show how correction rules 
can be learned. Context-independent rules are learned when the system only sug- 
gests a single possible translation. When there is ambiguity about what the cor- 
rect translation of a word is, the likelihood ratio is used to identify words that co- 
occur significantly which each translation option. 

Using these rules, almost 25,000 words are changed on a corpus of over half a 
million words. On a sample of 600 changes, the context-independent rules have a 
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precision of 99% and the context-dependent rules have a precision of 79%. One 
of the open questions for machine translation research is how to evaluate a trans- 
lation. A few automated methods have been suggested such as BLEU ([Papineni 
et ai, 2001]), which is based on n-gram occurrence in a reference text. Although 
these methods have merit, for the particular rules learned by our system, an n- 
gram metric would almost always see improvement since changing a Spanish 
word in English text to an English word will generally be better. For this reason, 
we instead chose to evaluate the results by hand. 

A majority of the context-independent rules represent changes where the origi- 
nal system did not know any possible translation, so it is not surprising that the 
precision is high. The context-dependent rules have lower precision even though 
a significance level of 0.001 was used. The main reason for this lower precision 
is that the likelihood ratio can suggest collocations that are significant, but that 
are not useful for ambiguity resolution. This is attenuated when the counts are 
very small or when the ambiguous translation is common and the counts are 
therefore high. In these cases, common words such as “is”, “that”, “it”, “have”, 
etc. are identified as significant. 

Another problem is the particular rule representation chosen. The context- 
dependent rules define the context as a bag of words. Unfortunately, a bag of 
words does not model many relationships, such as word order, syntax or seman- 
tics, which can be useful for discriminating significant collocations. Eor example, 
when deciding between “another” and “other” in the sentence fragment “Another 
important coordination type...”, the location of “type” and the fact that it is sin- 
gular suggests “another” as the correct translation. 

One final problem is that stemming can cause undesired side effects in the con- 
texts learned. As seen in the sentence fragment above, plurality is important, 
particularly when deciding between two translations that only differ by plurality. 
Unfortunately, stemming, in attempting to improve generality, removes the plu- 
rality of a word. The combination of these problems leads to a lower precision 
for the context-dependent rules. Future research should be directed towards em- 
ploying alternate rule representations and alternate collocation techniques such as 
in [Krenn, 2000]. 

The techniques that we used in this paper are just the beginning of a wide 
range of improvements and applications that use existing machine translation 
systems as a resource. As new applications that use translation systems arise, 
particularly those in time and information critical fields, such as [Damianos et 
ai, 2002], the importance of accurate automated translation systems becomes 
critical. 
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Abstract. Distributed heterogeneous search environments are an emerging phe- 
nomenon in Web search, in which topic-specific search engines provide search 
services, and metasearchers distribute user’s queries to only the most suitable 
search engines. Previous research has explored the performance of such envi- 
ronments from the user’s perspective (e.g., improved quality of search results). 
We focus instead on performance from the search service provider’s point of 
view (e.g, income from queries processed vs. resources used to answer them). 
We analyse a scenario in which individual search engines compete for queries 
by choosing which documents to index. We propose the COUGAR algorithm 
that specialised search engines can use to decide which documents to index on 
each particular topic. COUGAR is based on a game-theoretic analysis of het- 
erogeneous search environments, and uses reinforcement learning techniques to 
exploit the sub-optimal behaviour of its competitors. 



1 Introduction 

Heterogeneous search environments are a recent phenomenon in Web search. They can 
be viewed as a federation of independently controlled metasearchers and many spe- 
cialised search engines. Specialised search engines provide focused search services in 
a specific domain (e.g. a particular topic). Metasearchers help to process user queries 
effectively and efficiently by distributing them only to the most suitable search engines 
for each query. Compared to the traditional search engines like Google or AltaVista, 
specialised search engines (together) provide access to arguably much larger volumes 
of high-quality information resources, frequently called the “deep” or “invisible” Web. 

Previous work has mainly explored the performance of such heterogeneous search 
environments from the user’s perspective (e.g., improved quality of search results). Ex- 
amples include algorithms for search engine selection and result merging [1]. On the 
other hand, a provider of search services is more interested in the income from queries 
processed vs. resources used to answer them. To the best of our knowledge, little at- 
tention has been paid to performance optimisation of search engines from the service 
provider’s point of view. 

An important factor that affects performance of a specialised search engine in a 
heterogeneous search environment is competition with other independently controlled 
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search engines. When there are many search engines available, users want to send their 
queries to the engine(s) that would provide the best possible results. Multiple search 
providers in a heterogeneous search environment can be viewed as participants in a 
search services market competing for user queries. 

We examine the problem of performance-maximising behaviour for non-cooper- 
ative specialised search engines in heterogeneous search environments. We analyse a 
scenario in which individual search engines compete for queries by choosing to index 
documents for which they think users are likely to query. Our goal is to propose a 
method that specialised search engines can use to select on which topic(s) to specialise 
and how many documents to index on that topic to maximise their performance. 

While the search engines in a heterogeneous search environment are independent in 
terms of selecting their content, they are not independent in terms of the performance 
achieved. Changes to parameters of one search engine affect the queries received by its 
competitors and, vice versa, actions of the competing engines influence the queries re- 
ceived by the given search engine. Thus, the utility of any local content change depends 
on the state and actions of other search engines in the system. The uncertainty about ac- 
tions of competitors as well as the potentially large number of competing engines make 
our optimisation problem difficult. We show that naive strategies (e.g, blindly indexing 
lots of popular documents) are ineffective, because a rational search engine’s indexing 
decisions should depend on the (unknown) decisions of its opponents. 

Our main contributions are as follows: 

- We formalise the issues related to optimal behaviour in competitive heterogeneous 
search environments and propose a model for performance of a specialised search 
engine in such environments. 

- We provide game-theoretic analysis of a simplified version of the problem and mo- 
tivate the use of the concept of “bounded rationality” [2], Bounded rationality 
assumes that decision makers act sub-optimally in the game-theoretic sense due to 
incomplete information about the environment and/or limited resources. 

- We propose a reinforcement learning procedure for topic selection, called 
COUGAR, which allows a specialised search engine to exploit sub-optimal be- 
haviour of its competitors to improve own performance. 

An evaluation of COUGAR in a simulation, driven by real user queries submitted 
to over 47 existing search engines, demonstrates the feasibility of our approach. 

2 Problem Formulation 

2.1 Search Engine Performance 

We adopt an economic view on search engine performance from the service provider’s 
point of view. Performance is a difference between the value of the search service pro- 
vided (income) and the cost of the resources used to provide the service. The value of a 
search service is a function of the user queries processed. The cost structure in an actual 
search engine may be quite complicated involving many categories, such as storage, 
crawling, indexing, and searching. In our simplified version of the problem, we only 
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take into account the cost of resources involved in processing search queries. (Note that 
we also obtained similar results for a more elaborated cost model that takes into account 
the cost of document crawling, storage, and maintenance [3].) Under these assumptions, 
we can use the following formula for search engine performance: P = aQ — PQD, 
where Q is the number of queries processed in a given time interval, D is the number 
of documents in the search engine index, a and (3 are constants. 

aQ represents the service value: if the price of processing one search request for 
a user is a, then aQ would be the total income from service provisioning. (3QD rep- 
resents the cost of processing search requests. If x amount of resources is sufficient 
to process Q queries, then we would need 2x to process twice as many queries in the 
same time. Similarly, if x resources is enough to search in D documents for each query, 
then we would need 2x to search twice as many documents in the same time. Thus, 
the amount of resources (and, hence, the cost) is proportional to both Q and D, and 
so can be expressed as (3QD, where /3 is a constant reflecting the resource costs. An 
examination of the architecture of the FAST search engine (www. alltheweb . com) 
shows that our cost function is not that far from reality [4]. 

We assume that all search engines in our system use the same a and j3 constants 
when calculating their performance. Having the same (3 reasonably assumes that the 
cost of resources (per “unit”) is the same for all search engines. Having the same a 
assumes, perhaps unrealistically, that the search engines choose to charge users the same 
amount per query. We leave to future work, however, optimisation of search engine 
performance in environments where engines may have different service pricing. With 
no service price differentiation, selection of search engines by the metasearcher depends 
on what documents the engines index. Therefore, the goal of each search engine would 
be to select the index content in a way that maximises its performance. 

2.2 Metasearch Model 

We assume a very generic model of how any reasonable metasearch system should be- 
have. This will allow us to abstract from implementation details of particular metasearch 
algorithms (presuming that they approximate our generic model). It is reasonable to as- 
sume that users would like to send queries to the search engine(s) that contain the most 
relevant documents to the query, and the more of them, the better. 

The ultimate goal of the metasearcher is to select for each user query to which search 
engines it should be forwarded to maximise the results relevance, while minimising the 
number of engines involved. The existing research in metasearch (e.g. [1]), however, 
does not go much further than simply ranking search engines. Since it is unclear how 
many top ranked search engines should be queried (and how many results requested), 
we assume that the query is always forwarded to the highest ranked search engine. In 
case several search engines have the same top rank, one is selected at random. 

The ranking of search engines is performed based on the expected number of rel- 
evant documents that are indexed by each engine. Engine i that indexes the largest 
expected number of documents NRj relevant to query q will have the highest rank. 

We apply a probabilistic information retrieval approach to assessing relevance of 
documents [5]. For each document d, there is a probability Pr(rel|g, d) that this doc- 
ument will be considered by the user as relevant to query q. In this case, NRf = 
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d), where by d S i we mean the set of documents indexed by engine 
i. Obviously, the metasearcher does not know the exact content of search engines, so it 
tries to estimate NRf from the corresponding content summaries. 

If Pr(reZ|(7i, d) = Pr{rel\q2,d),\/d then queries qi and q2 will look the same from 
both metasearcher’s and search engine’s points of view, even though the queries may 
differ lexically. All engines will have the same rankings for qi and q2, and the queries 
will get forwarded to the same search engine. Therefore, all queries can be partitioned 
into equivalence classes with identical Pr{rel\q, d) functions. We call such classes top- 
ics. We assume in this paper that there is a fixed hnite set of topics and queries can 
be assigned to topics. Of course, this it not feasible in reality. One way to approximate 
topics in practice would be to cluster user queries received in the past and then assign 
new queries to the nearest clusters. 

2.3 Engine Selection for “Ideal” Crawlers 

Let us assume that users only issue queries on a single topic. We will see later how 
this can be extended to multiple topics. It follows from Section 2 . 2 , that to receive 
queries, a search engine needs to be the highest ranked one for this topic. It means that 
given an index size D, a search engine would like to have a document index with the 
largest possible NRi. This can be achieved, if the engine indexes the D most relevant 
documents on the topic. 

Population of search engines is performed by topic-specific (focused) Web crawlers 
[ 6 ]. Since it is very difficult to model a Web crawler, we assume that all search engines 
have “ideal” Web crawlers which for a given D can find the D most relevant documents 
on a given topic. Under this assumption, two search engines indexing the same number 
of documents D\ = D2 will have NRi = NR2. Similarly, if D\ < D2, then NRi < 
NR2 (assuming that all documents have Pr(reZ|d) > 0 ). Therefore, the metasearcher 
will forward user queries to the engine(s) containing the largest number of documents. 

This model can be extended to multiple topics, if we assume that each document 
can only be relevant to a single topic. In this case, the state of a search engine can be 
represented by the number of documents D\ that engine i indexes for each topic t. A 
query on topic t will be forwarded to the engine i with the largest Dj . 

2.4 Decision Making Process 

The decision making process proceeds in series of fixed-length time intervals. For each 
time interval, search engines simultaneously and independently decide on how many 
documents to index on each topic. They also allocate the appropriate resources accord- 
ing to their expectations for the number of queries that users will submit during the 
interval. Since search engines cannot have unlimited crawling resources, we presume 
that they can only do incremental adjustments to their index contents that require the 
same time for all engines. The user queries submitted during the time interval are allo- 
cated to the search engines based on their index parameters (Z?*) as described above. 
The whole process repeats in the next time interval. 

Let Ql be the number of queries on topic t that, according to expectations of search 
engine i, the users will submit. Then the total number of queries expected by engine 
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i can be calculated as Qi = J2t-D*>o assume that engines always allocate 

resources for the full amount of queries expected. Then the cost of resources allocated 
by engine i can be expressed as j3QiDi, where Di = D\ is the total number of 
documents indexed by engine i. The number of queries on topic t actually forwarded 
to engine i can be represented as 

r p : 3j,Dl<D] 

* : i e B,B = {b : Dl = maxj Z?*} 

where B is the set of the highest-ranked search engines for topic t, and Q* is the number 
of queries on topic t actually submitted by the users. That is, the search engine does not 
receive any queries, if it is ranked lower than competitors, and receives its appropriate 
share when it is the top ranked engine (see Sections 2.2 and 2.3). The total number of 
queries forwarded to search engine i can be calculated as Qi = J2t D*>o Ql- 

We assume that if the search engine receives more queries than it expected (i.e. 
more queries than it can process), the excess queries are simply rejected. Therefore, the 
total number of queries processed by search engine i equals to Qi). Finally, 

the performance of engine i over a given time interval can be represented as follows; 
Pi = amin(Qi, Q*) - PQiDi- 

3 The COUGAR Approach 

The decision-making process for individual search engines can be modelled as a multi- 
stage game [7]. At each stage, a matrix game is played, where players are search en- 
gines, actions are values of (I?-), and player i receives payoff Pi. 

If player i knew the actions of its opponents and user queries at a future stage k, it 
could calculate the optimal response as the one maximising Pi{k). For example, in case 
of a single topic it should play Di{k) = Dj{k) + 1, if max^^i Dj{k) 4- 1 < 

a/ f3, and Di{k) = 0 otherwise (simply put, outperform opponents by 1 document if 
profitable, and do not incur any costs otherwise). 

In reality, players do not know the future. Uncertainty about future queries can 
be largely resolved by reasonably assuming that user interests usually do not change 
quickly. That is, queries in the next interval are likely to be approximately the same as 
queries in the previous one. A more difficult problem is not knowing future actions of 
the opponents (competing search engines). One possible way around this would be to 
agree on (supposedly, mutually beneficial) future actions in advance. To avoid decep- 
tion, players would have to agree on playing a Nash equilibrium [7] of the game, since 
then there will be no incentive for them to not follow the agreement. Agreeing to play 
a Nash equilibrium, however, becomes problematic when the game has multiple such 
equilibria. Players would be willing to agree on a Nash equilibrium yielding to them the 
highest (expected) payoffs, but the task of characterising all Nash equilibria of a game 
is NP-hard even given complete information about the game (as follows from [8]). 

NP-hardness results and the possibility that players may not have complete infor- 
mation about the game and/or their opponents lead us to the idea of “bounded rational- 
ity” [2]. Bounded rationality assumes that players may not use the optimal strategies 
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in the game-theoretic sense. Our proposal is to cast the problem of optimal behaviour 
in the game as a learning task, where the player would have to learn a strategy that 
performs well against its sub-optimal opponents. 

Learning in games have been studied extensively in both game theory and machine 
learning. Some examples include fictious play and opponent modelling. Fictions play 
assumes that the other players are following some Markovian (possibly mixed) strate- 
gies, which are estimated from their historical play [9]. Opponent modelling assumes 
that opponent strategies are representable by finite state automata. The player learns 
parameters of the opponent’s model from experience and then calculates the the best- 
response automaton [10]. We apply a more recent technique from reinforcement learn- 
ing called GAPS (which stands for Gradient Ascent for Policy Search) [1 1]. In GAPS, 
the learner plays a parameterised strategy represented, e.g., by a finite state automaton, 
where parameters are probabilities of actions and state transitions. GAPS implements 
stochastic gradient ascent in the space of policy parameters. After each learning trial, 
parameters of the policy are updated by following the payoff gradient. 

GAPS has a number of advantages important for our domain. It works in partially 
observable games (e.g. it does not require complete knowledge of the opponents’ ac- 
tions). It also scales well to multiple topics by modelling decision-making as a game 
with factored actions (where action components correspond to topics). The action space 
in such games is the product of factor spaces for each action component. GAPS, how- 
ever, allows us to reduce the learning complexity: rather than learning in the product 
action space, separate GAPS learners can be used for each action component. It has 
been shown that such distributed learning is equivalent to learning in the product ac- 
tion space. As with all gradient-based methods, the disadvantage of GAPS is that it is 
only guaranteed to find a local optimum. We call a search engine that uses the proposed 
approach COUGAR, which stands for Competitor Using GAPS Against Rivals. 

3.1 Engine Controller Design 

The task of the search engine controller is to change the state of the document index to 
maximise the engine performance. When making decisions, the engine controller can 
receive information about current characteristics of its own search engine as well the 
external environment in the form of observations. 

The COUGAR controllers are modelled by non-deterministic Moore automata. A 
controller consists of a set of Moore automata (M*), one for each topic, functioning 
synchronously. Each automaton is responsible for controlling the state of the search in- 
dex for the corresponding topic. The following actions are available to each automaton 
M* in the controller: Grow - increase the number of documents indexed on topic t by 
one; Same - do not change the number of documents on topic t\ Shrink - decrease the 
number of documents on topic t by one. The resulting action of the controller is the 
product of actions (one for each topic) produced by each of the individual automata. 

A controller’s observations consist of two parts: observations of the state of its own 
search engine and observations of the opponents’ state. The observations of its own 
state reflect the number of documents in the search engine’s index for each topic. The 
observations of the opponents’ state reflect the relative position of the opponents in the 
metasearcher rankings, which indirectly gives the controller partial information about 
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the state of the opponents’ index. The following three observations of the opponents’ 
state are available for each topic t\ Winning - there are opponents ranked higher for 
topic t than our search engine; Tying - opponents have either the same or a smaller rank 
for topic t than our search engine; Losing - the rank of our search engine for topic t is 
higher than opponents. 

For T topics, the controller’s inputs consist of T observations of the state of its 
own search engine (one for each topic) and T observations of the relative positions of 
the opponents (one per topic). Note, that the state of all opponents is summarised as a 
vector of T observations. Each of the Moore automata M* in the COUGAR controller 
receives observations only for the corresponding topic t. 

One may ask how the controller can obtain information about rankings of its op- 
ponents for a given topic. This can be done by sending a query on the topic of interest 
to the metasearcher and requesting a ranked list of search engines for the query. We 
also assume that the controller can obtain from the metasearcher information (statis- 
tics) on the queries previously submitted by user. This data are used in calculation of 
the expected number of queries for each topic Q\. In particular, for our experiments 
the number of queries on topic t expected by engine i in a given time interval k equals 
to the number of queries on topic t submitted by users in the previous interval (i.e. 
Qlik) = Q\k - 1)). 

3.2 Learning Procedure 

Training of the COUGAR controller to compete against various opponents is performed 
in series of simulation trials. Each simulation trial consists of 100 days, where each day 
corresponds to one state of the multi-stage game played. The search engines start with 
empty indices and then, driven by their controllers, adjust their index contents. In the be- 
ginning of each day, search engine controllers receive observations and simultaneously 
produce control actions (change their document indices). A query generator issues a 
stream of search queries for one day. The metasearcher distributes these queries be- 
tween the search engines according to their index parameters on the day. The resulting 
reward in a simulation trial is calculated in the traditional for reinforcement learning 
way as a sum of discounted rewards from each day. After each trial, a learning step is 
performed. The COUGAR controller updates its strategy using the GAPS algorithm. 
That is, the action and state transition probabilities of the controller’s Moore automata 
are modified using the payoff gradient (due to the lack of space see [11] for details of 
the update mechanism). 

In our experiments, we simulated two competing search engines for a single and 
multiple topics. One search engine was using a fixed strategy, the other one was using 
the COUGAR controller. To simulate user search queries, we used HTTP logs obtained 
from a Web proxy of a large ISP. Since each search engine uses a different URL syntax 
for submission of requests, we developed URL extraction rules individually for 47 well- 
known search engines The total number of queries extracted was 657,861 collected over 
a period of 190 days. We associated topics with search terms in the logs. To simulate 
queries for n topics, we extracted the n most popular terms from the logs. The number 
of queries generated on topic t during a given time interval was equal to the number of 
queries with term t in the logs belonging to this time interval. 
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Fig. 1. “Bubble” vs COUGAR, single topic. Left: learning curve. Right: sample trial. 



4 Results 

4.1 “Bubble” Strategy 

The “Bubble” strategy tries to index as many documents as possible without any re- 
gard to what competitors are doing. As follows from our performance formula (see 
Section 2.4), such unconstrained growing leads eventually to negative performance. 
Once the total reward falls below a certain threshold, the “Bubble” search engine goes 
bankrupt (i.e. it shrinks its index to 0 documents and retires until the end of the trial). 
This process imitates the situation in which a search service provider expands its busi- 
ness without paying attention to costs, eventually runs out of money, and quits. An 
intuitively sensible response to the “Bubble” strategy would be to wait until the bubble 
“bursts” and then come into the game alone. That is, a competitor should not index 
anything while the “Bubble” grows and should start indexing a minimal number of 
documents once the “Bubble” search engine goes bankrupt. 

Figure 1 (left) shows how the performance of the COUGAR engine improved during 
learning for a single topic case. Once COUGAR reached a steady performance level, its 
resulting strategy was evaluated in a series of testing trials. Figure 1 (right) visualises a 
sample trial between the “Bubble” and the COUGAR engines by showing the number 
of documents indexed by the engines on each day of the trial. 

In case of multiple topics, the “Bubble” was increasing (and decreasing) the number 
of documents indexed for each topic simultaneously. The COUGAR controller was 
using separate GAPS learners to manage the index size for each topic (as discussed in 
Section 3). Figure 2 shows the engines’ behaviour (left) and performance (right) in a test 
trial with two different topics. Note that COUGAR has learned to wait until “Bubble” 
goes bankrupt, and then to win all queries for both topics. 



4.2 “Wimp” Strategy 

The “Wimp” controller used a more intelligent strategy. Consider it first for the case 
of a single topic. The set of all possible document index sizes is divided by “Wimp” 
into three non-overlapping sequential regions: “Confident”, “Unsure”, and “Panic”. The 
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Fig. 2. “Bubble” vs COUGAR, multiple topics. Left: sample trial; the top half of Y axis shows 
the number of documents for topic 1 , while the bottom half shows the number of documents for 
topic 2. Right: performance in a sample trial. 
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Fig. 3. “Wimp” vs COUGAR, single topic. Left: learning curve. Right: sample trial. 



“Wimp’s” behaviour in each region is as follows: Confident - the, strategy in this region 
is to increase the document index size until it ranks higher than the opponent. Once this 
goal is achieved, the “Wimp” stops growing and keeps the index unchanged; Unsure 
- in this region, the “Wimp” keeps the index unchanged, if it is ranked higher or the 
same as the opponent. Otherwise, it retires (i.e. reduces the index size to 0); Panic - the 
“Wimp” retires straight away. 

The overall idea is that the “Wimp” tries to outperform its opponent while in the 
“Confident” region by growing the index. When the index grows into the “Unsure” 
region, the “Wimp” prefers retirement to competition, unless it is already winning over 
or tying with the opponent. This reflects the fact that the potential losses in the “Unsure” 
region (if the opponent wins) become substantial, so the “Wimp” does not dare to risk. 

Common sense tells us that one should behave aggressively against the “Wimp” 
in the beginning, to knock him out of competition, and then enjoy the benefits of 
monopoly. This is exactly what the COUGAR controller has learned to do as can he 
seen from Figure 3. 

To generalise the “Wimp” strategy to multiple topics, it was modified in the fol- 
lowing way. The “Wimp” opponent did not differentiate between topics of both queries 
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Fig. 4. “Wimp” vs COUGAR, multiple topics. Left: learning curve. Right: sample trial; the top 
half of Y axis shows the number of documents for topic 1, while the bottom half shows the 
number of documents for topic 2. 



and documents. When assessing its own index size, the “Wimp” was simply adding the 
documents for different topics together. Similarly, when observing relative positions of 
the opponent, it was adding together ranking scores for different topics. Finally, like the 
multi-topic “Bubble”, the “Wimp” was changing its index size synchronously for each 
topic. Figure 4 presents the learning curve (left) and a sample trial (right) respectively. 
COUGAR decided to specialise on the more popular topic, where it outperformed the 
opponent. The “Wimp” mistakenly assumed that it was winning in the competition, 
since its rank for both topics together was higher. In reality, it was receiving only queries 
for one topic, which did not cover its expenses for indexing documents on both topics. 

4.3 Self-play 

In the final set of experiments, we analysed behaviour of the COUGAR controller com- 
peting against itself. It is not guaranteed from the theoretical point of view that the 
gradient-based learning will always converge in self play. In practice, however, we ob- 
served that both learners converged to relatively stable strategies. We used the same 
setup with two different topics in the system. Figure 5 (left) shows that the players de- 
cided to split the query market: each of the search engines specialised on a different 
topic. Figure 5 (right) also shows the learning curves. 



5 Related Work 

The issues of performance (or profit) maximising behaviour in environments with mul- 
tiple, possibly competing, decision makers have been addressed in a number of contexts, 
including multi-agent e-commerce systems and distributed databases. 

In Mariposa [12], the distributed system consists of a federation of databases and 
query brokers. A user submits a query to a broker for execution together with the amount 
of money she is willing to pay for it. The broker partitions the query into sub-queries and 
hnds a set of databases that can execute the sub-queries with the total cost not exceeding 



optimising Performance of Competing Search Engines 227 



6 

4 



6 



9000 
8000 
7000 
•E 6000 

i 

e 5000 
*3 

H 4000 
3000 
2000 
1000 

0 10 20 30 40 50 60 70 80 90 100 0 20000 40000 60000 80000 100000 

Days Trials 



COUGAR 1 {topic 1) — ^ — 
COUGAR 1 (topic 2) — 

COUGAR 2 {topic 1) - 

COUGAR 2 {topic 2) — ^ 

lAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAi 



r 




Fig. 5. COUGAR in self play, multiple topics. Left: sample trial; the top half of Y axis shows the 
number of documents for topic 1, while the bottom half shows the number of documents for topic 
2. Right: learning curve. 



what the user paid and the minimal processing delay. A database can execute a sub- 
query only if it has all necessary data (data fragments) that are involved. The databases 
can trade data fragments (i.e. purchase or sell them) to maximise their revenues. 

Trading data fragments may seem similar to the topic selection problem for spe- 
cialised search engines. There are, however, significant differences between them. Ac- 
quiring a data fragment is an act of mutual agreement between the seller and the buyer, 
while search engines may change their index contents independently from others. Also, 
a proprietorship considerations are not taken into account. 

Green wald et al have studied behaviour dynamics of pricebots, automated agents 
that act on behalf of service suppliers and employ price-setting algorithms to maximise 
profits [ 1 3] . In the proposed model, the sellers offer a homogeneous good in an economy 
with multiple sellers and buyers. The buyers may have different strategies for selecting 
the seller, ranging from random to the selection of the cheapest seller on the market 
(bargain hunters), while the sellers use the same pricing strategy. A similar model but 
with populations of sellers using different strategies has been studied in [14]. The pric- 
ing problem can be viewed as a very simple instance of our topic selection task (namely, 
as a single topic case with some modifications to the performance model). 

6 Conclusions and Future Work 

The successful deployment of heterogeneous Web search environments will require that 
participating search service providers have effective means for managing the perfor- 
mance of their search engines. We analysed how specialised search engines can select 
on which topic(s) to specialise and how many documents to index on that topic to max- 
imise their performance. We provided both an in-depth theoretical analysis of the prob- 
lem and a practical method for automatically managing the search engine content in a 
simplified version of the problem. Our adaptive search engine, COUGAR, has managed 
to compete with some non-trivial opponents as shown by the experimental results. Most 
importantly, the same learning mechanism worked successfully against opponents us- 
ing different strategies. Even when competing against other adaptive search engines (in 
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our case itself), COUGAR has demonstrated a fairly sensible behaviour from the eco- 
nomic point of view. Namely, the engines have learned to segment the search services 
market with each engine occupying its niche, instead of a head-on competition. 

We do not claim to provide a complete solution for the problem here, but we be- 
lieve it is the promising first step. Clearly, we have made many strong assumptions 
in our models. One future direction will be to relax these assumptions to make our 
simulations more realistic. In particular, we intend to perform experiments with real 
documents and using some existing metasearch algorithms. This should allow us to 
avoid the assumption of the “single-topic” documents and also to assess how closely 
our metasearch model reflects real-life engine selection algorithms. We also plan to use 
clustering of user queries to derive topics in our simulations. 

While we are motivated by the optimal behaviour for search services over document 
collections, our approach is applicable in more general scenarios involving services that 
must weigh the cost of their inventory of objects against the expected inventories of 
their competitors and the anticipated needs of their customers. For example, it would be 
interesting to apply our ideas to an environment in which large retail e-commerce sites 
must decide which products to stock. Another important direction would be to further 
investigate performance and convergence properties of the learning algorithm when 
opponents also evolve over time (e.g. against other learners). One possible approach 
here would be to use a variable learning rate as suggested in [15]. 
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Abstract. A central issue in logical concept induction is the prospect of 
inconsistency. This problem may arise due to noise in the training data, 
or because the target concept does not fit the underlying concept class. In 
this paper, we introduce the paradigm of inductive belief merging which 
handles this issue within a uniform framework. The key idea is to base 
learning on a belief merging operator that selects the concepts which 
are as close as possible to the set of training examples. From a computa- 
tional perspective, we apply this paradigm to robust fc-DNF learning. To 
this end, we develop a greedy algorithm which approximates the optimal 
concepts to within a logarithmic factor. The time complexity of the algo- 
rithm is polynomial in the size of k. Moreover, the method bidirectional 
and returns one maximally specific concept and one maximally general 
concept. We present experimental results showing the effectiveness of our 
algorithm on both nominal and numerical datasets. 



1 Introduction 

The problem of logical concept induction has occupied a central position in 
machine learning [1,2]. Informally, a concept is a formula defined over some 
knowledge representation language called the concept class, and an example is a 
description of an instance together with a label, positive if the instance belongs 
to the unknown target concept and negative otherwise. The problem is to extrap- 
olate or induce from a collection of examples called the training set, a concept 
in the concept class that accurately classifies future, unlabelled instances. 

A useful paradigm for studying this issue is the notion of version space intro- 
duced by Mitchell in [1]. Given some concept class, the version space for a train- 
ing set is simply the set of concepts in the concept class that are consistent with 
the data. Probably, the most salient feature of this paradigm lies in the property 
of bidirectional learning [3]. Namely, for admissible concept classes like fc-DNF, 
fc-CNF or Horn theories, every concept in a version space can be factorized from 
below by a maximally specific concept and from above by a maximally general 
concept. Thus, a version space incorporates two dual strategies for learning a 
target concept, one from a specific viewpoint (allowing errors of omission) and 
the other from a general viewpoint (allowing errors of commission). This bidi- 
rectional approach is particularly useful when the available data is not sufficient 



N. Lavrac et al. (Eds.): ECML 2003, LNAI 2837, pp. 229-240, 2003. 
@ Springer- Verlag Berlin Heidelberg 2003 



230 



Frederic Koriche and Joel Quinqueton 



to converge to the unique identity of the target concept. From this perspective, 
Mitchell proposed to generate the set S of all maximally specific concepts and 
the set G of all maximally general concepts, using the so-called Candidate Elim- 
ination algorithm [1]. Since these sets are often expensive in space [4], Sablon 
and his colleagues [5] proposed to maintain only one maximally specific and one 
maximally general concept. Although their learning algorithm does not pretend 
to capture the whole solution set, it guarantees a linear space-complexity. 

Unfortunately, the version-space paradigm have proven fundamentally lim- 
ited in practice due to its inability to handle inconsistency. A set of examples is 
said to be inconsistent with respect to a given concept class if no concept in the 
class is able to distinguish the positive from the negative examples. In presence 
of inconsistency, any version space becomes empty (it is said to collapse) and 
hence, the learning algorithm can fail into trivialization. In fact, as noticed by 
Clark and Niblett [6], very few real world problems operate under consistent con- 
ditions. Inconsistency may arise due to the imperfectness of the “training set”. 
For example, some observations may contain noise due to imperfect measuring 
equipments, or the available data can be collected from several, not necessarily 
agreeing sources. Inconsistency may also occur due the incompleteness of the 
“concept class”. In practice, the target concept class is not known in advance, 
so the learner can use a hypothesis language which is inappropriate for the tar- 
get concept. Nonetheless, even inconsistent environments may contain a great 
deal of valid information. Therefore, it seems important to develop alternative 
paradigms for robust learners that should allow to learn as much as possible 
given the training data and the concept class available. 

Several authors have attempted to handle this issue by generalizing the stan- 
dard paradigm of version spaces. Notably, Hirsh and Cohen [7,8] consider incon- 
sistency has a problem of reasoning about uncertainty. Informally, each example 
which is assumed to be corrupted gives rise to a set of supposed instances. 
The learner computes all version spaces consistent with at least one supposed 
instance originated from any observed example and then returns their inter- 
section. As mentioned by the authors, this approach asks the question of how 
sets of supposed instances are acquired in practice. Moreover, consistency is not 
guaranteed to be recovered: if the sets of supposed instances are chosen inap- 
propriately then the resulting version space may collapse, as in the standard 
case. Last, this scheme is basically limited because the number of version spaces 
maintained in parallel during the learning phase can grow exponentially. 

In another line of research, Sebag [9,10] develops a model of disjunctive ver- 
sion spaces which deals with inconsistency by using a voting mechanism. A 
separate classifier is learned for each positive training example taken with the 
set of all negative examples, then new instances are classified by combining the 
votes of these different hypothesis. The complexity of induction is shown to be 
polynomial in the number of instances. However, an important inconvenient of 
the approach is the poor comprehensibility of the resulting concept (typically a 
disjunction of conjunctions of disjunctions). Moreover, we loose the bidirectional 
property of version spaces since only maximally general concepts are learned. 
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In this study, we adopt a radically different approach inspired from belief 
merging, a research field that has received increasing attention in the database 
and the knowledge representation communities [11,12,13]. The aim of belief 
merging is to infer from a set of theories, expressed in some logical formalism, a 
new theory considered as the overall knowledge of the different sources. When 
the initial theories are consistent together, the result is simply the intersection of 
their models. However, in presence of inconsistency, a nontrivial operator must 
be elaborated. The key idea of the so-called “distance-based” merging operators 
is to select those models that are close as possible to the initial theories, using 
an appropriate metric in the space of all possible interpretations. 

The main insight underlying our study is to base learning on a merging 
operator that selects the concepts which are as close as possible to the set of 
training examples. In the present paper, we apply this idea to robust fc-DNF 
learning. As argued by Valiant [14,15], the DNF family is a natural class for 
expressing and understanding real concepts in propositional learning. 

In section 2, we present the paradigm of inductive belief merging. In this set- 
ting, we define a distance-based merging operator that introduces a preference 
bias in the fc-DNF class, induced by the sum of the distances d{(f,e) between a 
concept ip and each example e in the training set. The resulting “robust version 
space” is the set of all concepts whose distance to the training set is minimal. 
In section 3, we show that every concept in this version space can be character- 
ized by a corresponding “minimal weighted cover” defined from the training set 
and the concept class. This establishes a close relationship between the learn- 
ing problem and the so-called weighted set cover problem [16,17]. Based on this 
correspondence, we develop in section 4 a greedy algorithm which builds a cover 
that approximates the optimum to within a logarithmic factor. The algorithm is 
bidirectional and returns the maximally specific fc-DNF and the maximally gen- 
eral fc-DNF generated from the approximate cover. The method is guaranteed 
to be polynomial in time and only uses a linear space. 

From a conceptual point of view, a benefit of our paradigm is that it allows 
to characterize robust learning in terms of three distinguished biases, namely, 
the restriction bias imposed by the concept class, the preference bias defined by 
the merging operator, and the search bias given by the approximation algorithm. 
From an empirical point of view, we report in section 5 experiments on twenty 
datasets that show diversity in size, number of attributes and type of attributes. 
For almost all domains, we show that robust fc-DNF learning is equal or superior 
to the popular C4.5 decision-tree learning algorithm [18,19]. 

2 Inductive Belief Merging 

In this section, we present the logical aspects of our framework. We begin to 
introduce some usual definitions in concept learning and then, we detail the 
notion of inductive belief merging. 
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2.1 Preliminaries 

We consider a finite set V of boolean variables. A literal is either a variable v 
or its negation -<v. A term is a conjunction of literals and a DNF formula is a 
disjunction of terms. In the following, we shall represent DNF as sets of terms 
and terms as sets of literals. Given a positive integer fc, a fc-term is a term that 
contains at most k literals and a fc-DNF concept is a DNF composed of fc-terms. 
Given two fc-DNF concepts ip and ip, we say that ip is more specific than ip (or 
equivalently ip is more general than p) if is a subset of ip. 

An instance is a map x from V to {0,1}. Given an instance x and a formula 
p, we say that x is consistent with p li x is a logical model of p. Otherwise, 
we say that x is inconsistent with p. An example e is a pair (xe,Ve) where Xe 
is an instance and is a boolean variable. An example e is called positive if 
We = 1 and negative if Wg = 0. Given an example e and a formula p, we say that 
e is consistent (resp. inconsistent) with p if Xe is consistent (resp. inconsistent) 
with p. Given a positive integer k and a positive (resp. negative) example e, 
the atomic version space of e with respect to fc, denoted Cfc(e), is set of all fc- 
DNF concepts that are consistent (resp. inconsistent) with e. Now, given a set of 
examples E, the version space of E with respect to fc, denoted Ck{E), is the set 
of all fc-DNF concepts that are consistent with every positive example in E and 
that are inconsistent with every negative example in E. As observed by Hirsh in 
[8] , the overall version space of E is simply the intersection of the atomic version 
spaces defined for each example in E: 

Ck{E) = f| Okie). 

e&E 

A training set E is called consistent with respect to the fc-DNF class if 
Ck{E) is not empty, and inconsistent otherwise. When E is consistent, the aim 
of concept learning is then to find a concept p in Ck{E). However, in case 
of inconsistency, Ck{E) collapses and the problem fails into triviality. So, it is 
necessary to generalize the notion of version space in order to handle this issue. 

2.2 Learning via Merging 

The key idea underlying our framework is to replace the “intersection” operator 
by a “merging” operator. To this end, we need some additional definitions. Given 
two DNF formulas p and ip, the term distance between p and ip, denoted d{p, ip), 
is defined as the number of terms the two concepts differ: 

d{p,ip) = l(</3U?/') - (¥’n'0)|. 

This notion of distance can be seen as the number elementary operations 
needed to transform the first concept into the second one. Now, given a fc-DNF 
concept p and an example e, the distance between p and e with respect to fc, 
denoted dk{p,e), is defined by the minimum distance between this concept and 
the atomic version space of e: 

dk{p,e) = Tom{d{p,ip) : ip £ Cfc(e)}. 
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Intuitively, the distance between ip and e is the minimal number of fc-terms 
that need to be added or deleted in order to correctly cover e. Specifically, if e 
is positive, then the distance between ip and e is the minimal number of fc-terms 
that need to be added in ip in order to be consistent with x^- From a dual point 
of view, if e is negative, then the distance is the minimal number of fc-terms that 
need to be deleted in ip in order to be inconsistent with x^- 

Finally, given a fc-DNF concept ip and a set of examples if, the distance 
between ip and E with respect to k, denoted dk{ip, E) is the sum of the distances 
between this concept and the examples that occur in E: 

dkW,E) = ^ dk{(f,e). 

eG-E 

Interestingly, we observe that this distance induces a preference ordering over 
the /c-DNF class defined by the following condition: ip is more preferred than if 
for E with respect to k if dk{ip, E) < dk(if, E). It is easy to see that the preference 
relation is a total pre-order. Thus, we say that is a most preferred concept for 
E with respect to k if dk{ip,E) is minimal, that is, for every fc-DNF formula if 
we have dk{ip, E) < dk{if, E). Now, we have all elements in hand to capture the 
solution set of “learning via merging” . Given a positive integer k and a training 
set E, the induetive merging of E with respect to A:, denoted Ak{E), is the set 
of all most preferred concepts for E in the fc-DNF class: 

Ak{E) = {ip : p \s a, fc-DNF concept and dk{P: E) is minimal}. 

This model of robust induction embodies two important properties. First, 
it is guaranteed to never collapse. This is a direct consequence of the above 
definition. Second, inductive merging is a generalization of the standard notion 
of version space. Namely, if E is consistent with respect to the fc-DNF concept 
class, then Ak{E) = Ck{E). This lies in the fact that d{p, if) = 0 iff v? S Ck(E). 

Example 1. Suppose that the training set E is defined by the following examples: 

ei = ({wi,U2},l), 62 = ({i;i,-''!;2}, 1), 63 = ({-.t;!, -.- i;2}, 1), 64 = -•i;2}, 0) 

and 65 = ({“'Ui, r> 2 }, 0). Suppose further that the concept class is the set of 
all 1-DNF (simple disjuncts). Clearly, the version space C'i(if) would collapse 
here. Now, consider the distances reported on the table below (we only examine 
non trivial disjuncts). We observe that Afc(if) includes two maximally specific 
concepts {wij and {“'U 2 }, and one maximally general concept {vi,->V 2 }- 
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3 A Representation Theorem 

After an excursion into the logical aspects of the framework, we now provide a 
representation theorem that enables to characterize solutions in Ak{E) in terms 
of minimal weighted covers. As we shall see in the next section, this representa- 
tion is particularly useful for constructing efficient approximation algorithms. 

To this end, we need some additional definitions. Given a set of examples E 
and a fc-term t, the extension of t in E, denoted E(t) is the set of examples in E 
that are consistent with t. The weight of t in E, denoted w{t, E) is the size of the 
extension of t in E. Given a set of examples E, a cover of if is a list of fc-terms 
7T = (ti, - ■ ■ ,tn) such that every positive example e in if is consistent with at 
least one term ti in tt. Intuitively, the index i denotes the priority of the term 
ti in the cover tt, with the underlying assumption that 1 is the highest priority 
and n is the lowest priority. Given a cover tt of if and a A:-term t, the extension 
of t in if with respect to tt is inductively defined by the following conditions: 

( E{t) iit = ti 

E{t, tt) = < E{t) — U{E(tj,Tr) '■ 1 < j < i} ii t = ti for 1 < i < n 
[ 0 otherwise 

The weight of a fc-term t in E with respect to tt, denoted w(t,iT,E) is given 
by the size of if(t,7r). We notice that if t is not a member of the cover tt then 
its weight is simply set to 0. Now, given a training set if, let if„ be the set of 
negative examples in E and let Ep be the set of positive examples in E. The 
following lemma states that the distance between a concept and a training set 
can be characterized in terms of weights. 

Lemma 1. For every k-DNF concept (p and every set of examples E: 
d{ip,En) = '^w{t,En), and 

d{(fi,Ep) = min{d((/3, 7 t) : tt is a cover of E} where d{Lp,Tr) = w{t, tt, Ep). 

t^ip 

Proof. The first property can be easily derived from the fact that, for every 
negative example e, d{(p, e) is the number of terms t in ip which are consistent 
with e. Let us examine the second property. Let Ep = if^ U {e} and suppose 
by induction hypothesis that tt' is a cover of E'p such that d{ip, tt') is minimal. 
We know that d{ip,Ep) = d{ip,E'p) + d{ip,e). To this point, we remark that 
d{ip, e) = 0 if e is consistent with at least one term t in ip and 1 otherwise. First, 
assume that d{p, e) = 0 and let t be a term in ip which is consistent with e. The 
cover TT is defined as follows: tt = tt', if tt' covers Ep, and tt = tt' U {f} otherwise. 
In both cases we have '^t^,pW{t,T^,Ep) = '^t^,pW{t,TT' ,E'p). Since d{ip,e) = 0, 
it follows that d{(p,Ep) = d{ip,Tr), as desired. Second, assume that d{ip,e) = 1. 
Let t be an arbitrary fc-term that is consistent with e. As previously, the cover 
TT is defined by tt' if tt' covers e and tt' U {t}, otherwise. In both cases, we have 
'^t^,pW{t,TT,Ep) = '^t^^pWitTTr' , E'p) + I. Since the right-hand side is the sum 
of d{ip,E'p) and d{ip,e), we obtain d{ip,Ep) = d{ip,Tr), as desired. 
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Now, we turn to the notion of “minimal weighted cover” . Let be the set of 
all fc-terms generated from the boolean variables and let k be the cardinality of 
Tfc. Given a set of examples E and a cover tt of E, the weight of tt in E, denoted 
w{-K, E) is defined as follows: 

K 

w{tt, E) = E min(w(ti, En),w{ti, tt, Ep)) 

i=l 

A cover tt is called minimal if its weight is minimal, that is, for every other 
cover 7 t' of E, we have w{tt,E) < w{tt',E). Informally, the weight of a minimal 
cover corresponds to the optimal distance of the concepts in Ak{E). Further- 
more, a minimal cover embodies a whole “lattice” of most preferred concepts. 
In particular, the maximally specific concept of tt, denoted S-„ is the set of all 
fc-terms t in Tk such that w{t^TT, Ep) < w(t,En) and dually, the maximally 
general concept of tt, denoted G,r, is the set of all fc-terms t in T^, such that 
w{ti,En) if w{ti,TT,Ep). With these notions in hand, we are in position to give 
the representation theorem. 

Theorem 1. For every k-DNF concept p and every set of examples E: 
p € Ak{E) iff there exists a minimal cover n of E such that 3-,^ Q p Q Gt^. 

Proof. Let tt be a cover of E such that d((/?, tt) is minimal and let pft, tt) be an 
abbreviation of min(w(t, En), wft, tt, Ep)). Based on lemma 1, we can derive that 
dfp, E) is the sum of three parts: 

d{p, E) = '^{w{t, En) - p{t, tt)) + ^(w(t, TT, Ep) - p{t, tt)) + w{tt, E). 

t£ip 

Let p' be a concept such that p' G Ak{E). Based on the above result, we have 
d{p',E) > w{tt',E) where tt' is a cover of E such that d{p',Tr') is minimal. 
Dually, let p he a concept such that Q p Q G,r for some minimal cover tt of 
E. We may observe that d(S',r,A) = d(G^,£’) = w{tt,E) since, in both cases, 
the first part and the second part of the above equation are set to 0. Thus, 
d{p,E) < w{tt,E). Therefore, we have w{tt,E) < w{tt',E). 

Now, suppose that p f Ak{E). It follows that d{p,E) > d{p'E). Therefore, 
we derive w{tt, E) > w{tt', E), but this contradicts the above result. On the other 
hand, assume that p' is not factorized by and G,r'- As w{tt,E) < w{tt',E), 
there are two cases. First, if d{Tr',E) > w{tt,E), we then obtain d{p',E) > 
d{p,E). Therefore p' f A^{E), but this contradicts the initial assumption. 
Second, if d{Tr',E) = d{TT,E), then tt' is a minimal cover of E. It follows that 
5 'tt' 2 f' or f' 2 Gtt'. In both situations, it is easy to derive d{p', E) > w{tt'E). 
Thus, we obtain d{p' , E) > d{p, E). Therefore p' f Afc(£’), hence contradiction. 

Example 2. Let us examine the training set E given in example 1. Based on the 
1-DNF concept class, we may generate eight covers of E. Notably, we observe 
that TT = {vi,->V 2 ) and tt' = {~'V 2 ,vi) are minimal covers of E. The weight of 
these covers is 2. In the first case, = {ui} and in the second case St^' = {“'U 2 }. 
Furthermore, we have G,r = G^' = {vi,~<V 2 }. 
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4 An Approximation Algorithm 

As demonstrated in the previous section, the concept learning problem has close 
similarities with the so-called “weighted set cover” problem. This last problem is 
known to be NP-hard in the general case, yet efficient approximation algorithms 
have been proposed in the literature [16]. Based on these considerations, we 
develop in this section an approximation method that returns a cover which is 
as close as possible to the optimal distance. 

The algorithm is detailed in figure 1. The intuitive idea underlying the algo- 
rithm is to select terms in a “greedy” manner, by choosing at each iteration the 
term that covers the most positive examples and the least negative ones. 



Input: A training set E and an integer fc > 1. 

Output: The most specific concept St^ and the most general concept G,r 
of a fc-DNF cover vr of A. 

1. Set T = {t : t is a fc-term}. Set P = Ep. Set tt = 0; 

2. If P = 0 then stop and output and Gjr- 

3. Find a term t £ T that minimizes the quotient ^ 

w{t, P) ^ 0. In case of a tie, take t which maximizes w(t, P); 

4. Append t at the end of tt. Set P = P — P{t). Return to step 2. 



Fig. 1. MergeDnf(P, fc). 



An important feature of this algorithm is that it bidirectional: it returns a 
maximally specific concept and a maximally general concept, with respect to 
the cover that has been found. Furthermore the algorithm has the additional 
property that, while it does not always find a “minimal” cover, it tends to ap- 
proximate such a cover to within a logarithmic factor. 

Theorem 2. For every k-DNF concept class and every training set E, if m is 
the number of positive examples and w* is the weight of a minimal cover, then 
MergeDnf(P, fc) is guaranteed to find a cover of weight at most (iu*-l-l) ln(m). 

Proof. The demonstration is a variant of the proof given in [16]. In any iteration 
i, let €i be a positive example that has not yet been covered by the algorithm. 
Let t be the first term to cover a. The cost of Ci, denoted cost(ei), is given 
by the quotient ^ p _ gg^ gf remaining elements. 

The size of this set is bounded by m — i -|- 1. In this case, the optimal solution 
can cover the remaining elements of at a weight at most w* + 1. Therefore, 
we obtain cost(e) < . It follows that the weight of the cover generated 

by the algorithm is at most cost(e) < ^ + ^)Hm- Since 

Hm ln(m), we obtain {w* + 1) ln(m), as desired. 
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As a corollary of this proposition, we may determine that the worst-case time 
complexity of the algorithm is linear in the number of fc-terms and polynomial 
in the number of examples. Let n be the number of boolean variables. For sake 
of simplicity, let us assume that E contains m positive examples and m negative 
examples. Step 1 of the algorithm requires only 0{mn^) time. Moreover, the 
number of iterations of the algorithm is bounded by 0{{m + l)ln(m)). This 
corresponds to the worst case where the optimal weight is given by the number 
of positive examples. Therefore, since step 3 requires 0{mn^) time, the overall 
time bound of the algorithm is 0{rn?n^ ln(m)). 

5 Experiments 

This section reports experimental validations of our learning scheme on a repre- 
sentative collection of datasets from the UCI Machine Learning Repository. 

Based on the bidirectional property of the algorithm, the learner can choose 
between two classifiers for classifying test data, namely, the maximally specific 
concept and the maximally general concept generated from the cover. From this 
viewpoint, each training set was split into a learning set used to induce concepts 
from data and a test set used to select the best classifier. In all the experiments, 
the fraction of the training set used as internal test data was set to 5%. Each 
experiment was then decomposed into three stages: 1) learn the two concepts 
from the learning set, 2) select the best concept on the internal test set, and 3) 
validate the resulting concept on the remaining, external test set. 

The twenty datasets are summarized in Table 1. For each benchmark prob- 
lem, the first section gives the number of examples, the number of continuous 
and discrete attributes, the number of classes, and the percentage of examples 
in the majority class. The datasets are taken without modification from the 
UCI repository with one exception: in the “waveform” problem, a 300-example 
dataset was generated, as suggested in [19]. The last two sections provide an 
empirical comparison of our learning scheme with the C4.5 decision-tree learner. 
To measure generalization error, we ran ten different 10-fold cross validations for 
each dataset and averaged the results. The second section details the accuracy 
results obtained by MergeDNF on 2-DNF formulas. Since the algorithm has 
been designed for two-class problems, the goal was to separate the most frequent 
class from the remaining classes. In case of tie, the target class was selected in 
a random way. Continuous data was discretized using the “equal-width binning 
method” [20]. The number of bins b was set to 6 = max(l,2 • log(^)) where / is 
the number of distinct observed values for each attribute. 

Finally, the last column reports accuracy results obtained the C4.5 algorithm. 
We used C4.5 Release 8 that deals with noise by incorporating (by default) an 
error-based pruning technique and that handles continuous data using a method 
inspired by the Minimum Description Length principle. For all domains, the 
algorithm was run with the same default settings for all the parameters; no 
attempt was made to tune the system for these problems. Notice that the results 
are very similar to those reported by Quinlan in [19] . 
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Table 1. Comparison of MergeDNF with C4.5 



Dataset 


size 


attributes 
cont disc 


classes 
nb majority 


MergeDNF 
(fc = 2) 


C4.5 

Release 8 


breast-w 


699 


9 


— 


2 


65.52 


97.55 ± 1.05 


94.76 ± 


1.94 


colic 


368 


10 


12 


2 


60.00 


85.07 ±2.29 


85.08 ± 


2.85 


credit-a 


690 


6 


9 


2 


55.51 


88.39 ± 2.46 


85.16 ± 


2.24 


credit-g 


1000 


7 


13 


2 


70.00 


82.52 ±2.48 


71.40 ± 


2.94 


diabetes 


768 


8 


— 


2 


65.10 


76.03 ± 2.72 


74.46 ± 


2.87 


glass 


214 


9 


— 


6 


35.51 


82.33 ± 5.49 


67.12 ± 


7.18 


heart-c 


303 


8 


5 


2 


54.12 


84.00 ± 3.87 


76.72 ± 


4.91 


heart-h 


294 


8 


5 


2 


63.95 


93.07 ±4.19 


80.17 ± 


3.12 


heart-s 


123 


8 


5 


2 


93.50 


95.50 ±3.51 


93.12 ± 


4.52 


heart-v 


200 


8 


5 


2 


74.50 


88.30 ± 4.50 


72.74 ± 


6.02 


hepatisis 


155 


6 


13 


2 


54.84 


88.67 ±5.13 


80.81 ± 


5.91 


hypo 


3772 


7 


22 


5 


95.23 


97.22 ±0.53 


99.49 ± 


0.08 


iris 


150 


4 


— 


3 


33.33 


97.37 ±2.63 


95.02 ± 


1.92 


labor 


57 


8 


8 


2 


64.91 


88.60 ±7.11 


82.55 ± 


9.73 


sick 


3772 


7 


22 


2 


90.74 


95.85 ± 0.72 


98.64 ± 


0.33 


sonar 


208 


60 


— 


2 


53.37 


86.55 ±5.82 


73.09 ± 


6.57 


splice 


3190 


— 


62 


3 


51.88 


97.39 ± 1.05 


94.18 ± 


0.83 


vehicule 


846 


18 


— 


4 


25.77 


91.45 ± 1.40 


72.13 ± 


0.40 


voting 


435 


— 


16 


2 


61.38 


97.40 ± 1.32 


94.83 ± 


2.72 


waveform 


300 


21 


- 


3 


33.92 


89.03 ± 3.05 


74.16 ± 


2.10 



Of course, the comparison of MergeDNF with C4.5 is biased since, notably, 
the two learners actually use different techniques for handling continuous data. 
Nevertheless, it is clear that, over these datasets, the inductive merging scheme is 
competitive with pruned-based decision-tree learning. Specifically, the accuracy 
results reveal that MergeDNF, for 2-DNF formulas, is approximatively equal 
or superior to C4.5 Release 8 on 18 of the 20 datasets. The performance of the 
algorithm is particularly significant on noisy datasets like the “heart disease” 
family. In all these domains, MergeDNF outperforms C4.5 even when pruning 
was employed. Furthermore, we observe that the algorithm is very effective on 
continuous datasets. The “glass” , “sonar” and “waveform” domains are particu- 
larly notable examples. These benchmark problems are known to be difficult for 
machine-learning algorithms, due to overlapping classes and numerical noise. In 
these datasets, MergeDNF also outperforms C4.5. 

From a computational point of view, we observed that in our learning scheme 
the 2-DNF family offers an interesting compromise between the effectiveness 
of the learner and the time spent to generate covers. For almost all domains, 
the learning time was inferior to 10 seconds, using a Pentium IV-1.5GHz. For 
datasets containing a small number of attributes (e.g. “glass” or “iris”), the 
learning time was even smaller than 1 second. The only notable exception is the 
“splice” domain which needed approximatively 110 seconds. 

In table 2, we briefly illustrated how the performance of the learner depends 
upon the choice of the parameter k. CPU times are given in seconds. Interest- 
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Table 2. Dependence among k for the MergeDNF algorithm 



Dataset 


1-DNF 

accuracy cpu 


2-DNF 

accuracy cpu 


3-DNF 

accuracy cpu 


glass 

heart-c 

heart-h 

heart-s 

heart-v 

iris 


75.33 ±6.02 0.007 
83.23 ±3.62 0.013 
91.27 ±5.21 0.021 
95.83 ±3.15 0.006 
73.75 ±5.19 0.012 
97.40 ±3.30 0.002 


82.33 ± 5.49 0.462 
84.00 ±3.57 0.462 
93.07 ±4.19 0.471 
95.50 ±3.51 0.121 
88.30 ±4.50 0.284 
97.37 ± 2.63 0.013 


85.33 ±4.60 2.740 
89.70 ±3.13 5.950 
92.72 ±4.57 5.615 
95.50 ±3.25 1.235 

90.30 ±3.82 3.175 

97.31 ± 2.44 0.041 



ingly, we remark that the accuracy of the learner does not necessarily increase 
with k. On-going research investigates the use of model selection methods [21] 
in order to choose the appropriate value of k during the learning phase. 

6 Conclusion 

This study lies at the intersection of two research fields: concept learning and 
belief merging. On the one hand, the aim of concept learning is to induce from 
a set of examples a concept in an hypothesis language that is consistent with 
the examples. On the other hand, the aim of belief merging is to infer from a 
set of belief bases a new theory which is as consistent as possible with the initial 
beliefs. The main insight underlying this study has been to base induction on a 
belief merging operator that selects the concepts which are as close as possible 
from the training examples, using an appropriate distance measure. 

Several directions of future research are possible. In this paper, we have re- 
stricted the paradigm of inductive merging to /c-DNF concepts. An important 
issue is to extend this paradigm to other concept classes in both the proposi- 
tional setting and the first-order setting. A first question here is whether an 
appropriate distance measure can be defined on these concept classes. A sec- 
ond question is whether an algorithm can be designed for generating concepts 
of minimal distance or, at least, approximating this optimum to within a small 
factor. Some classes, like fc-CNF, are quite immediate. However, solving these 
questions for other concept classes, like Horn theories or first-order clausal theo- 
ries, is more demanding. Another line of research is to generalize further the idea 
of inductive merging. To this end, a wide variety of aggregation operators have 
been proposed in the belief merging literature. Some authors use a “weighted- 
sum” for capturing the level of confidence of belief theories [11]. Other authors 
advocate “max” functions in order to satisfy some principles of arbitration [12] . 
To this point, it would be interesting to examine these operators in the setting of 
robust concept learning. For example, a “weighted-sum” would be particularly 
relevant for training examples that do not have the same level of confidence. 
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Abstract. Tree induction methods and linear models are popular tech- 
niques for supervised learning tasks, both for the prediction of nominal 
classes and continuous numeric values. For predicting numeric quanti- 
ties, there has been work on combining these two schemes into ‘model 
trees’, i.e. trees that contain linear regression functions at the leaves. In 
this paper, we present an algorithm that adapts this idea for classifica- 
tion problems, using logistic regression instead of linear regression. We 
use a stagewise fitting process to construct the logistic regression models 
that can select relevant attributes in the data in a natural way, and show 
how this approach can be used to build the logistic regression models at 
the leaves by incrementally refining those constructed at higher levels in 
the tree. We compare the performance of our algorithm against that of 
decision trees and logistic regression on 32 benchmark UCI datasets, and 
show that it achieves a higher classification accuracy on average than the 
other two methods. 



1 Introduction 

Two popular methods for classification are linear logistic regression and tree 
induction, which have somewhat complementary advantages and disadvantages. 
The former fits a simple (linear) model to the data, and the process of model 
fitting is quite stable, resulting in low variance but potentially high bias. The 
latter, on the other hand, exhibits low bias but often high variance: it searches a 
less restricted space of models, allowing it to capture nonlinear patterns in the 
data, but making it less stable and prone to overfitting. So it is not surprising 
that neither of the two methods is superior in general — earlier studies [10] have 
shown that their relative performance depends on the size and the characteristics 
of the dataset (e.g., the signal-to-noise ratio). 

It is a natural idea to try and combine these two methods into learners that 
rely on simple regression models if only little and/or noisy data is available 
and add a more complex tree structure if there is enough data to warrant such 
structure. For the case of predicting a numeric variable, this has lead to ‘model 
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trees’, which are decision trees with linear regression models at the leaves. These 
have been shown to produce good results [11]. Although it is possible to use 
model trees for classification tasks by transforming the classification problem 
into a regression task by binarizing the class [4] , this approach produces several 
trees (one per class) and thus makes the final model harder to interpret. 

A more natural way to deal with classification tasks is to use a combination of 
a tree structure and logistic regression models resulting in a single tree. Another 
advantage of using logistic regression is that explicit class probability estimates 
are produced rather than just a classification. In this paper, we present a method 
that follows this idea. We discuss a new scheme for selecting the attributes to be 
included in the logistic regression models, and introduce a way of building the 
logistic models at the leaves by refining logistic models that have been trained 
at higher levels in the tree, i.e. on larger subsets of the training data. 

We compare the performance of our method against the decision tree learner 
C4.5 [12] and logistic regression on 32 UCI datasets [1], looking at classification 
accuracy and size of the constructed trees. We also include results for two learn- 
ing schemes that build multiple trees, namely boosted decision trees and model 
trees fit to the class indicator variables [4] , and a different algorithm for building 
logistic model trees called PLUS [8]. From the results of the experiments we 
conclude that our method achieves a higher average accuracy than C4.5, model 
trees, logistic regression and PLUS, and is competitive with boosted trees. We 
will also show that it smoothly adapts the tree size to the complexity of the data 
set. 

The rest of the paper is organized as follows. In Section 2 we briefly review 
logistic regression and the model tree algorithm and introduce logistic model 
trees in more detail. Section 3 describes our experimental study, followed by 
a discussion of results. We discuss related work in Section 4 and draw some 
conclusions in Section 5. 



2 Algorithms 

This section begins with a brief introduction to the application of regression for 
classification tasks and a description of our implementation of logistic regression. 
A summary of model tree induction is also provided as this is a good starting 
point for understanding our method. 

2.1 Logistic Regression 

Linear regression performs a least-squares fit of a parameter vector /3 to a nu- 
meric target variable to form a model 

fix) = 0^ -X, 

where x is the input vector (we assume a constant term in the input vector to 
accommodate the intercept). It is possible to use this technique for classification 
by directly fitting linear regression models to class indicator variables. If there 
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1. Start with weights Wij = 1/n, i = 1, . . . ,n, j = 1, . . . , J, Fj(x) = 0 
and Pj{x) = 1/ J 'ij 



2. Repeat for m — 1, . . . , M ■. 

(a) Repeat for j = 1, . . . , J : 

i. Compute working responses and weights in the j'th class 
^ ytj -Pjjxi) 



Wij = Pj{Xi){l - Pj{Xi)) 

ii. Fit the function fmj(x) by a weighted least-squares regression 
of Zij to Xi with weights wtj 

(b) Set fmj{x) <- ^(/mj(x) - 7 frrik{x)),Fj{x) ^ Fj{x) + fmj{x) 

(c) Update Pj (a;) = 






Z—>k = l 

3. Output the classiher argmaxjFj(x) 



Fig. 1. LogitBoost algorithm for J classes. 



are J classes then J indicator variables are created and the indicator for class j 
takes on value 1 whenever class j is present and value 0 otherwise. However, this 
approach is known to suffer from masking problems in the multiclass setting [7]. 

A better method for classification is linear logistic regression, which models 
the posterior class probabilities Pr{G = j\X = x) for the J classes via linear 
functions in x while at the same time ensuring they sum to one and remain in 
[0,1]. The model has the form^ 



l^k = l ^ k=l 

where Fj{x) = (3j ■ x are linear regression functions, and it is usually fit by 
finding maximum likelihood estimates for the parameters j3j . 

One way to find these estimates is based on the LogitBoost algorithm [6] . Log- 
itBoost performs forward stage-wise fitting of additive logistic regression models, 
which generalize the above model to Fj{x) = where the fmj can 

be arbitrary functions of the input variables that are fit by least squares regres- 
sion. In our application we are interested in linear models, and LogitBoost finds 
the maximum likelihood linear logistic model if the fmj are fit using (simple or 
multiple) linear least squares regression and the algorithm is run until conver- 
gence. This is because the likelihood function is convex and LogitBoost performs 
quasi-Newton steps to find its maximum. 

The algorithm (shown in Figure 1) iteratively fits regression functions fmj 
to a ‘response variable’ (reweighted residuals). The xi, . . . ,x„ are the training 
examples and the y*j encode the observed class membership probabilities for 

^ This is the symmetric formulation [6]. 



244 



Niels Landwehr, Mark Hall, and Elbe Frank 



instance Xi, i.e. y*j is one if Xi is labeled with class j and zero otherwise. One 
can build the fmj by performing multiple regression based on all attributes 
present in the data, but it is also possible to use a simple linear regression, 
selecting the attribute that gives the smallest squared error. If the algorithm is 
run until convergence this will give the same final model because every multiple 
linear regression function can be expressed as a sum of simple linear regression 
functions, but using simple regression will slow down the learning process and 
thus give a better control over model complexity. This allows us to obtain simple 
models and prevent overfitting of the training data: the model learned after a 
few iterations (for a small M) will only include the most relevant attributes 
present in the data, resulting in automatic attribute selection, and, if we use 
cross-validation to determine the best number of iterations, a new variable will 
only be added if this improves the performance of the model on unseen cases. 

In our empirical evaluation simple regression together with (five fold) cross- 
validation indeed outperformed multiple regression. Consequently, we chose this 
approach for our implementation of logistic regression. We will refer to it as the 
SimpleLogistic algorithm. 

2.2 Model Trees 

Model trees, like ordinary regression trees, predict a numeric value given an 
instance that is defined over a fixed set of numeric or nominal attributes. Unlike 
ordinary regression trees, model trees construct a piecewise linear (instead of a 
piecewise constant) approximation to the target function. The final model tree 
consists of a decision tree with linear regression models at the leaves, and the 
prediction for an instance is obtained by sorting it down to a leaf and using the 
prediction of the linear model associated with that leaf. 

The M5’ model tree algorithm [13] — a ‘rational reconstruction’ of Quinlan’s 
M5 algorithm [11] — first constructs a regression tree by recursively splitting the 
instance space using tests on single attributes that maximally reduce variance in 
the target variable. After the tree has been grown, a linear multiple regression 
model is built for every inner node, using the data associated with that node 
and all the attributes that participate in tests in the subtree rooted at that 
node. Then the linear regression models are simplified by dropping attributes 
if this results in a lower expected error on future data (more specifically, if 
the decrease in the number of parameters outweighs the increase in the observed 
training error). After this has been done, every subtree is considered for pruning. 
Pruning occurs if the estimated error for the linear model at the root of a subtree 
is smaller or equal to the expected error for the subtree. After pruning has 
terminated, MS’ applies a ‘smoothing’ process that combines the model at a leaf 
with the models on the path to the root to form the final model that is placed 
at the leaf. 

2.3 Logistic Model Trees 

Given this model tree algorithm, it appears quite straightforward to build a 
‘logistic model tree’ by growing a standard classification tree, building logistic 
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regression models for all nodes, pruning some of the subtrees using a pruning 
criterion, and combining the logistic models along a path into a single model in 
some fashion. However, the devil is in the details and M5’ uses a set of heuristics 
at crucial points in the algorithm — heuristics that cannot easily be transferred 
to the classification setting. 

Fortunately LogitBoost enables us to view the combination of tree induction 
and logistic regression from a different perspective: iterative fitting of simple lin- 
ear regression interleaved with splits on the data. Recall that LogitBoost builds 
a logistic model by iterative refinement, successively including more and more 
variables as new linear models fmj are added to the committee Fj . The idea is 
to recursively split the iterative fitting process into branches corresponding to 
subsets of the data, a process that automatically generates a tree structure. 

As an example, consider a tree with a single split at the root and two leaves. 
The root node N has training data T and one of its sons N' has a subset of the 
training data T' C T. Following the classical approach, there would be a logistic 
regression model M at node N trained on T and a logistic regression model M' 
at N' trained on T' . For classification, the class probability estimates of M and 
M' would be averaged to form the final model for N' . 

In our approach, the tree would instead be constructed by building a logistic 
model M at iV by fitting linear regression models trained on T as long as this 
improves the fit to the data, and then building the logistic model M' at N' 
by taking M and adding more linear regression models that are trained on T', 
rather than starting from scratch. As a result, the final model at a leaf consists of 
a committee of linear regression models that have been trained on increasingly 
smaller subsets of the data (while going down the tree). Building the logistic 
regression models in this fashion by refining models built at higher levels in the 
tree is computationally more efficient than building them from scratch. 

However, a practical tree inducer also requires a pruning method. In our 
experiments ‘local’ pruning criteria employed by algorithms like C4.5 and M5’ 
did not lead to reliable pruning. Instead, we followed the pruning scheme em- 
ployed by the CART algorithm [2], which uses cross-validation to obtain more 
stable pruning results. Although this increased the computational complexity, it 
resulted in smaller and generally more accurate trees. 

These ideas lead to the following algorithm for constructing logistic model 
trees: 

— Tree growing starts by building a logistic model at the root using the Logit- 
Boost algorithm. The number of iterations (and simple regression functions 
fmj to add to Fj) is determined using five fold cross-validation. In this pro- 
cess the data is split into training and test set five times, for every training 
set LogitBoost is run to a maximum number of iterations (we used 200) and 
the error rates on the test set are logged for every iteration and summed up 
over the different folds. The number of iterations that has the lowest sum of 
errors is used to train the LogitBoost algorithm on all the data. This gives 
the logistic regression model at the root of the tree. 
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LMT (examples ){ 

root = new NodeO 
alpha = getCARTAlpha(examples) 
root . buildTree (examples , null) 
root . CARTprune (alpha) 

} 

buildTree (examples , initialLinearModels) { 

numiterations = crossValidatelterations (examples , initialLinearModels) 
initLogitBoost (initialLinearModels) 
linearModels = copyOf (initialLinearModels) 
for i = 1 ... numiterations 

logitBoost Iteration (linearModels , examples) 
split = findSplit (examples) 

localExamples = split . splitExamples (examples) 
sons = new Nodes [split .numSubsets ()] 
for s = 1 ... sons . length 

sons .buildTree (localExamples [s] , nodeModels) 

> 

crossValidatelterations (examples , initialLinearModels) { 
for fold = 1. . .5 

initLogitBoost (initialLinearModels) 

//split into training/test set 
train = trainCV(f old) 
test = testCV(fold) 

linearModels = copyDf (initialLinearModels) 
for i = 1. . .200 

logitBoost Iter at ion (linearModels , train) 
logErrors[i] += error (test) 
numiterations = f indBestlteration(logErrors) 
return numiterations 

> 



Fig. 2. Pseudocode for the LMT algorithm. 



— A split for the data at the root is constructed using the C4.5 splitting cri- 
terion [12]. Both binary splits on numerical attributes and multiway splits 
on nominal attributes are considered. Tree growing continues by sorting the 
appropriate subsets of data to those nodes and building the logistic models 
of the child nodes in the following way: the LogitBoost algorithm is run on 
the subset associated with the child node, but starting with the committee 
Fj(x), weights Wij and probability estimates pij of the last iteration per- 
formed at the parent node (it is ‘resumed’ at step 2. a in Figure 1). Again, 
the optimum number of iterations to perform (the number of fjm to add to 
Fj) is determined by five fold cross validation. 

— Splitting continues in this fashion as long as more than 15 instances are at 
a node and a useful split can be found by the C4.5 splitting routine. 

— The tree is pruned using the CART pruning algorithm as outlined in [2] . 

Figure 2 gives the pseudocode for this algorithm, which we call LMT. The 
method LMT constructs the tree given the training data examples. It first calls 
getCARTAlpha to cross-validate the ‘cost-complexity-parameter’ for the CART 
pruning scheme implemented in CARTPrune. The method buildTree grows the 
logistic model tree by recursively splitting the instance space. The argument 
initialLinearModels contains the simple linear regression functions already 
fit by LogitBoost at higher levels of the tree. The method initLogitBoost 
initializes the probabilities/ weights for the LogitBoost algorithm as if it had 
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already fitted the regression functions initialLinearModels (resuming Logit- 
Boost at step 2. a). The method crossValidatelterations determines the num- 
ber of LogitBoost iterations to perform, and logitBoostlteration performs a 
single iteration of the LogitBoost algorithm (step 2), updating the probabili- 
ties/weights and adding a regression function to linearModels. 

Handling of Missing Values and Nominal Attributes. To deal with miss- 
ing values we calculate the mean (for numeric attributes) or mode (for categor- 
ical ones) based on all the training data and use these to replace them. The 
same means and modes are used to fill in missing values when classifying new 
instances. 

When considering splits in the tree, multi-valued nominal attributes are han- 
dled in the usual way. However, regression functions can only be fit to numeric 
attributes. Therefore, they are fit to local copies of the training data where 
nominal attributes with k values have been converted into k binary indicator 
attributes. 

Computational Complexity. The asymptotic complexity for building a logis- 
tic regression model is 0(n • • c) if we assume that the number of LogitBoost 

iterations is linear in the number of attributes present in the data^ (n denotes 
the number of training examples, v the number of attributes, and c the number 
of classes). The complexity of building a logistic model tree is 0{n-v^ ■d-c + k'^), 
where d is the depth and k the number of nodes of the initial unpruned tree. 
The first part of the sum derives from building the logistic regression models, 
the second one from the CART priming scheme. In our experiments, the time for 
building the logistic regression models accounted for most of the overall runtime. 
Compared to simple tree induction, the asymptotic complexity of LMT is only 
worse by a factor of v. However, the nested cross-validations (one to prune the 
tree, one to determine the optimum number of LogitBoost operations) constitute 
a large (albeit constant) multiplying factor. 

In the algorithm outlined above, the optimum number of iterations is deter- 
mined by a five fold cross-validation for every node. This is the most compu- 
tationally expensive part of the algorithm. We use two heuristics to reduce the 
runtime: 

— In order to avoid an internal cross-validation at every node, we determine 
the optimum number of iterations by performing one cross-validation in the 
beginning of the algorithm and then using that number everywhere in the 
tree. This approach works surprisingly well: it never produced results that 
were significantly worse than those of the original algorithm. This indicates 
that the best number of iterations for LogitBoost does depend on the dataset 
— just choosing a fixed number of iterations for all of the datasets lead 
to significantly worse results — but not so much on different subsets of a 
particular dataset (as encountered in lower levels in the tree). 

^ Note that in our implementation it is actually bounded by a constant (500 for stan- 
dalone logistic regression and 200 at the nodes of the logistic model tree) 
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— When performing the initial cross-validation, we have to select the number of 
iterations that gives the lowest error on the test set. Typically, the error will 
first decrease and later increase again because the model overfits the data. 
This allows the number of iterations to be chosen greedily by monitoring the 
error while performing iterations and stopping if the error starts to increase 
again. Because the error curve exhibits some spikes and irregularities, we 
keep track of the current minimum and stop if it has not changed for 25 
iterations. Using this heuristic does not change the behavior of the algorithm 
significantly. 

We included both heuristics in the final version of our algorithm, and all results 
shown for LMT refer to this final version. 

3 Experiments 

In order to evaluate the performance of our method and compare it against other 
state-of-the-art learning schemes, we applied it to several real-world problems. 
More specifically, we seek to answer the following questions in our experimental 
study: 

1. How does LMT compare to the two algorithms that form its basis, i.e., logis- 
tic regression and C4.5? Ideally, we would never expect worse performance 
than either of these algorithms. 

2. How does LMT compare to methods that build multiple trees? We include 
results for boosted C4.5 trees (using the AdaBoostMl algorithm [5] and 100 
boosting iterations), where the final model is a ’voting committee’ of trees 
and for the MS’ algorithm, which builds one tree per class when applied to 
classification problems. 

3. How big are the trees constructed by LMT? We expect them to be much 
smaller than simple classification trees because the leaves contain more in- 
formation. We also expect the trees to be pruned back to the root if a linear 
logistic model is the best solution for the dataset. 

We will also give results for another recently developed algorithm for inducing 
logistic model trees called ‘PLUS’ (see Section 4 for a short discussion of the 
PLUS algorithm). 

3.1 Datasets and Methodology 

For our experiments we used 32 benchmark datasets from the UCI repository [1], 
given in the first column of Table 1. Their size ranges from under hundred to a 
few thousand instances. They contain varying numbers of numeric and nominal 
attributes and some contain missing values. For more information about the 
datasets, see for example [4]. 

For every dataset and algorithm we performed ten runs of ten fold stratified 
cross-validation (using the same splits into training/test set for every method). 
This gives a hundred data points for each algorithm and dataset, from which we 
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Table 1. Average classification accuracy and standard deviation. 



Data Set 


LMT 


C4.5 




SimpleLogistic 


M5’ 




PLUS 


AdaBoost.Ml 


anneal 


99.5±0.8 


98.6±1.0 


• 


99.5T0.8 




98.6±1.1 




99.4T0.8 (c) 


99.6±0.7 




audiology 


84.0T7.8 


77.3±7.5 


• 


83.7±7.8 




76.8T8.6 


• 


80.6T8.3 (c) 


84.7±7.6 




australian 


85.0T4.1 


85.6±4.0 




85.2±4.1 




85.4T3.9 




85.2T3.9 (m) 


86.4T4.0 




autos 


75.8±9.7 


81.8±8.8 




75.1±8.9 




76.0±10.0 


76.6±8.7 (c) 


86.8T6.8 


o 


balance-scale 


90.0T2.5 


77.8±3.4 


• 


88.6±3.0 




87.8T2.2 


• 


89.7T2.8 (m) 


76.1±4.1 


• 


breast-cancer 


75.6±5.4 


74.3T6.1 




75.6±5.5 




70.4T6.8 


• 


71.5±5.7 (c) 


• 66.2T8.1 


• 


breast-w 


96.3±2.1 


95.0T2.7 




96.2T2.3 




95.9T2.2 




96.4T2.2 (c) 


96.7T2.2 




german 


75.3±3.7 


71.3±3.2 


• 


75.2T3.7 




75.0T3.3 




73.3T3.5 (m) 


74.5T3.3 




glass 


69.7T9.5 


67.6±9.3 




65.4±8.7 




71.3±9.1 




69.3T9.7 (c) 


78.8T7.8 


o 


glass (G2) 


76.5±8.9 


78.2T8.5 




76.9±8.8 




81.1T8.7 




83.2±ll.l(c) 


88.7T6.4 


o 


heart-c 


82.7T7.4 


76.9T6.6 


• 


83.1T7.4 




82.1T6.7 




78.2T7.4 (s) 


80.0T6.5 




heart-h 


84.2T6.3 


80.2T8.0 




84.2T6.3 




82.4±6.4 




79.8T7.8 (c) 


78.3±7.1 


• 


heart-statlog 


83.6±6.6 


78.1T7.4 


• 


83.7±6.5 




82.1T6.8 




83.7T6.4 (m) 


80.4±7.1 




hepatitis 


83.7±8.1 


79.2T9.6 




84.1±8.1 




82.4T8.8 




83.3T7.8 (m) 


84.9T7.8 




horse-colic 


83.7±6.3 


85.2T5.9 




82.2T6.0 




83.2T5.4 




84.0T5.8 (c) 


81.7±5.8 




hypothyroid 


99.6T0.4 


99.5T0.4 




96.8T0.7 


• 


99.4T0.4 




99.1T0.4 (c) 


• 99.7T0.3 




ionosphere 


92.7±4.3 


89.7T4.4 


• 


88.1T5.3 


• 


89.9T4.2 




89.5T5.2 (c) 


94.0T3.8 




iris 


96.2T5.0 


94.7±5.3 




96.3±4.9 




94.9T5.6 




94.3T5.4 (c) 


94.5T5.0 




kr-vs-kp 


99.7T0.3 


99.4T0.4 




97.4±0.8 


• 


99.2T0.5 


• 


99.5T0.4 (c) 


99.6T0.3 




labor 


91.5±10.9 


78.6±16.6» 


91.9±10.4 


85.1±16.3 


89.9±11.5(c) 


88.9±14.1 


Ivmphoffraphv 


84.7±9.6 


75.8±11.0» 


84.5±9.3 




80.4T9.3 




78.4±10.2(c) 


84.7T8.4 




pima-indians 


77.1T4.4 


74.5±5.3 




77.1±4.5 




76.6T4.7 




77.2T4.3 (m) 


73.9T4.8 


• 


primary-tumor 


46.7±6.2 


41.4±6.9 


• 


46.7±6.2 




45.3T6.2 




40.7T6.1 (c) 


• 41.7T6.5 


• 


segment 


97.1T1.2 


96.8T1.3 




95.4±1.5 


• 


97.4±1.0 




96.8±1.1 (c) 


98.6T0.7 


o 


sick 


98.9T0.6 


98.7±0.6 




96.7±0.7 


• 


98.4T0.6 


• 


98.6±0.6 (c) 


99.0T0.5 




sonar 


76.4T9.4 


73.6±9.3 




75.1T8.9 




78.4T8.8 




71.6±8.0 (c) 


85.1T7.8 


o 


soybean 


93.6T2.5 


91.8±3.2 




93.5±2.7 




92.9T2.6 




93.6T2.7 (c) 


93.3T2.8 




vehicle 


82.4T3.3 


72.3±4.3 


• 


80.4T3.4 




78.7T4.4 


• 


79.8±4.0 (m) 


77.9T3.6 


• 


vote 


95.7T2.8 


96.6±2.6 




95.7T2.7 




95.6T2.8 




95.3±2.8 (c) 


95.2T3.3 




vowel 


94.1T2.5 


80.2±4.4 


• 


84.2T3.7 


• 


80.9T4.7 


• 


83.0T3.7 (c) 


• 96.8T1.9 


o 


waveform-noise 


87.0T1.6 


75.3T1.9 


• 


86.9±1.6 




82.5±1.6 


• 


86.7±1.5 (m) 


85.0T1.6 


• 


zoo 


95.0T6.6 


92.6±7.3 




94.8±6.7 




94.5T6.4 




94.5±6.8 (c) 


96.3T6.1 





o, • statistically significant win or loss 



calculated the average accuracy (percentage of correctly classified instances) and 
standard deviation. To correct for the dependencies in the estimates we used the 
corrected resampled f-test [9] instead of the standard t-test on these data points 
to identify significant wins/losses of our method against the other methods at a 
5% significance level. 

Tablel gives the average classification accuracy for every method and dataset, 
and indicates significant wins/losses compared to LMT. Table 2 gives the num- 
ber of datasets on which a method (column) significantly outperforms another 
method (row). Apart from PLUS all algorithms are implemented in Weka 3.3.6 
(including LMT and SimpleLogistic)^. Note that PLUS has three different modes 
of operation: one to build a simple classification tree, and two modes that build 
logistic model trees using simple/multiple logistic regression models. For all 
datasets, we ran PLUS in all three modes and selected the best result, indi- 
cated by (c),(s) or (m) in Table 1. 



3.2 Discussion of Results 

To answer our first question, we observe from Table 1 that the LMT algorithm 
indeed reaches at least accuracy levels comparable to both SimpleLogistic and 

® Weka is available from www.cs.waikato.ac.nz/~ml 
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Table 2. Number of datasets where algorithm in column significantly outperforms 
algorithm in row. 



LMT C4.5 SimpleLogistic M5’ PLUS AdaBoost.Ml 



LMT 


- 


0 


0 


0 


0 


6 


C4.5 


13 


- 


6 


5 


5 


13 


SimpleLogistic 


6 


3 


- 


4 


4 


10 


M5’ 


8 


0 


1 


- 


1 


12 


PLUS 


4 


1 


1 


2 


- 


0 


AdaBoost.Ml 


7 


1 


3 


1 


0 


- 



Table 3. Tree size. 



Data Set 


LMT 


C4.5 


PLUS 


Data Set 


LMT 


C4.5 


PLUS 


anneal 


1.8 


38.0 o 


15.8 


o 


ionosphere 


4.6 


13.9 o 


13.5 o 


audiology 


1.0 


29.9 o 


47.6 


o 


iris 


1.1 


4.6 o 


6.1 o 


australian 


2.5 


22.5 o 


2.0 




kr-vs-kp 


8.0 


29.3 o 


42.6 o 


autos 


3.0 


44.8 o 


42.2 


o 


labor 


1.0 


4.2 o 


5.1 o 


balance-scale 


5.3 


41.6 o 


1.9 


• 


IvmphoffraDhv 


1.2 


17.3 o 


14.9 o 


breast-cancer 


1.1 


9.8 o 


9.1 


o 


pima-indians 


1.0 


22.2 o 


1.2 


breast-w 


1.4 


12.2 o 


1.2 




primary-tumor 


1.0 


43.8 o 


26.7 o 


german 


1.0 


90.2 o 


4.3 


o 


segment 


12.0 


41.2 o 


65.9 o 


glass 


7.0 


23.6 o 


25.2 


o 


sick 


14.1 


27.4 o 


30.0 o 


glass (G2) 


4.6 


12.5 o 


15.5 


o 


sonar 


2.7 


14.5 o 


13.2 o 


heart-c 


1.0 


25.7 o 


6.2 


o 


soybean 


3.7 


61.1 o 


42.4 o 


heart-h 


1.0 


6.3 o 


5.1 




vehicle 


3.5 


69.5 o 


1.0 • 


heart-statlog 


1.0 


17.8 o 


1.0 




vote 


1.1 


5.8 o 


5.8 o 


hepatitis 


1.1 


9.3 o 


1.5 




vowel 


5.2 


123.3 o 


156.9 o 


horse-colic 


3.7 


5.9 


6.6 




waveform-noise 


1.0 


296.5 o 


1.0 


hypothyroid 


5.6 


14.4 o 


12.9 


o 


zoo 


1.0 


8.4 o 


9.5 o 



o, • statistically significant increase or decrease 



C4.5, it is never significantly less accurate. It outperforms SimpleLogistic on six 
and C4.5 on 13 datasets, and both methods simultaneously on two datasets. Sim- 
pleLogistic performs surprisingly well on most datasets, especially the smaller 
ones. However, on some larger datasets (‘kr-vs-kp’, ‘sick’, ‘hypothyroid’) its per- 
formance is a lot worse than that of any other method. Linear models are prob- 
ably too restricted to achieve good performance on these datasets. 

With regard to the second question, we can say that LMT achieves sim- 
ilar results as boosted C4.5 (although with strengths/weaknesses on different 
datasets). Comparing LMT with M5’, we find better results for LMT on almost 
all datasets. Note that LMT also outperforms PLUS, even though the selection 
of the best result from the three modes for PLUS introduces an optimistic bias. 

To answer our third question. Table 3 gives the observed average tree sizes 
(measured in number of leaves) for LMT, C4.5 and PLUS. It shows that the trees 
built by the LMT algorithm are always smaller than those built by C4.5 and 
mostly smaller than those generated by PLUS. For many datasets the average 
tree size for LMT is very close to one, which essentially means that the algorithm 
constructs a simple logistic model. To account for small random fluctuations, we 
will say the tree is pruned back to the root if the average tree size is less than 
1.5. This is the case for exactly half of the 32 datasets, and consequently the 
results for LMT on these datasets are almost identical to those of SimpleLogistic. 
It can be seen from Table 1 that on all datasets (with the exception of ‘vote’) 
where the tree is pruned back to the root, the result for LMT is better than 
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that for C4.5, so it is reasonable to assume that for these datasets using a simple 
logistic regression model is indeed better than building a tree structure. Looking 
at the sixteen datasets where the logistic model tree is not pruned back to the 
root, we observe that on 13 of them LMT is more accurate than SimpleLogistic. 
This indicates a tree is only built if this leads to better performance than a 
single logistic model. From these two observations we can conclude that our 
method reliably makes the right choice between a simple logistic model and a 
more elaborate tree structure. 

We conclude that the LMT algorithm achieves better results than C4.5, Sim- 
pleLogistic, MS’ and PLUS, and results that are competitive with boosted C4.5. 
Considering that a single logistic model tree is easier to interpret than a boosted 
committee of C4.5 trees we think that LMT is an interesting alternative to boost- 
ing trees. Of course, one could also boost logistic model trees — but because 
building them takes longer than building simple trees this would be computa- 
tionally expensive. 



4 Related Work 

As mentioned above, model trees form the basis for the ideas presented here, 
but there has also been some interest in combining regression and tree induc- 
tion into ‘tree structured regression’ in the statistics community. For example, 
Chaudhuri et al. [3] investigate a general framework for combining tree induction 
with node-based regression models that are fit by maximum likelihood. Special 
cases include poisson regression trees (for integer-valued class variables) and lo- 
gistic regression trees (for binary class variables only). Chaudhuri et al. apply 
their logistic regression tree implementation to one real-world dataset, but it is 
not their focus to compare it to other state-of-the-art learning schemes. 

More recently, Lim presents an implementation of logistic regression trees 
called ‘PLUS’ [8]. There are some differences between the PLUS system and 
our method: first, PLUS does not consider nominal attributes when building 
the logistic regression models, i.e. it reverts to building a standard decision tree 
if the data does not contain numeric attributes. Second, PLUS uses a different 
method to construct the logistic regression models at the nodes. In PLUS, every 
logistic model is trained from scratch on the data at a node, whereas in our 
method the final logistic model consists of a committee of linear models trained 
on nested subsets of the data, thus naturally incorporating a form of ‘smoothing’. 
Furthermore, our approach automatically selects the best attributes to include 
in a logistic model, while PLUS always uses all or just one attribute (a choice 
that has to be made at the command line by the user). 

5 Conclusions 

This paper introduces a new method for learning logistic model trees that builds 
on earlier work on model trees. This method, called LMT, employs an efficient 
and flexible approach for building logistic models and uses the well-known CART 
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algorithm for pruning. Our experiments show that it is often more accurate than 
C4.5 decision trees and standalone logistic regression on real-world datasets, and, 
more surprisingly, competitive with boosted C4.5 trees. Like other tree induction 
methods, it does not require any tuning of parameters. 

LMT produces a single tree containing binary splits on numeric attributes, 
multiway splits on nominal ones, and logistic regression models at the leaves, and 
the algorithm ensures that only relevant attributes are included in the latter. 
The result is not quite as easy to interpret as a standard decision tree, but much 
more intelligible than a committee of multiple trees or more opaque classifiers 
like kernel-based estimators. 
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Abstract. In this paper, we try to apply kernel methods to solve the 
problem of color image segmentation, which is attracting more and more 
attention recently as color images provide more information than gray 
level images do. One natural way for color image segmentation is to do 
pixels clustering in color space. GMM has been applied for this task. 
However, practice has shown that GMM doesn’t perform this task well 
in original color space. Our basic idea is to solve the segmentation in a 
nonlinear feature space obtained by kernel methods. The scheme is that 
we propose an extension of EM algorithm for GMM by involving one ker- 
nel feature extraction step, which is called K-EM. With the technique 
based on Monte Carlo sampling and mapping, K-EM not only speeds 
up kernel step, but also automatically extracts good features for cluster- 
ing in a nonlinear way. Experiments show that the proposed algorithm 
has satisfactory performance. The contribution of this paper could be 
summarized into two points: one is that we introduced kernel methods 
to solve real computer vision problem, the other is that we proposed an 
efficient scheme for kernel methods applied in large scale problems. 



1 Introduction 

Image segmentation [1] is an essential and critical topic in image processing and 
computer vision. Recently, color image segmentation attracts more and more 
attention mainly due to the following reasons: (1) Color images provide more 
information than gray level images do. (2) The power of computers is increasing 
rapidly, and PCs can deal with color images more easily. 

Color image segmentation [2] can be viewed as an extension of gray level 
image segmentation, and there are various methods, which can be roughly cat- 
egorized as three typical ways: (1) color space clustering based segmentation, 
(2) edge or contour detection based segmentation and (3) region or area extrac- 
tion based segmentation. The basic idea of clustering based approach [3] is to 
directly cluster the pixels in color space by employing clustering algorithm such 
as k-means, Gaussian Mixture Models (GMM), etc. Edge based approach is a 
more global method. The basic idea is to firstly extract the edges using edges 
detectors such as Canny edge detector [4], secondly link the edges through edge 
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linking and tracing methods, and consequently obtain the segmentation by us- 
ing the linked closed edges. Region based approach [5] , including region growing, 
region split and merge, attempts to group pixels into homogeneous regions. 

In this paper, we address the problem of clustering based color image segmen- 
tation. In fact, image segmentation can be viewed as a hidden variable problem, 
i.e., segmenting an image into clusters involves determining which source cluster 
generates the image pixels. Based on this fact, a global mixture probabilistic 
model can be built up. For instance, GMM based on the basic Expectation- 
Maximization (EM) [6] algorithm is applied for image segmentation. 

One serious problem of GMM based segmentation is how to choose a proper 
color representation for segmentation. The frequently used color spaces, includ- 
ing linear space such as RGB and YIQ color space, nonlinear space such as HSV 
and LUV color space [2], have been verified in practice to be not suitable for 
clustering [3,7]. Thus two typical methods have been proposed towards the im- 
provement. One is that Data-Driven Markov Ghain Monte Garlo (DD-MGMG) 
[8] is employed to improve the performance of mixture probabilistic model for 
segmentation. The other is that Principal Gomponents Analysis (PGA) is ap- 
plied to find a characteristic color features space from mono- or hybrid- color 
spaces for detecting clusters [2]. However, two critical problems arose. One is 
that PGA is a linear method, but color image segmentation is more likely to be 
nonlinear. The other is that most clustering based approaches only utilize the 
color information, but do not utilize the geometrical or spatial information. All 
these motivate us to use kernel feature extraction method [9] instead of linear 
PGA to extract nonlinear features from both color and spatial information for 
clustering. 

The basic idea of the proposed method is to extend EM algorithm to embed 
one kernel feature extraction step (K-Step), which does feature extraction and 
mapping to transform data from input space to nonlinear feature space (It should 
be emphasized that any kernel feature analysis methods can be utilized in K- 
Step, where in this paper Kernel PGA is adopted). The benefit of K-EM for 
GMM is that we could not only avoid the extremely large computational cost of 
kernel feature analysis, but also extend GMM to be capable of dealing with data 
sets with complex structures. The experiments show that the proposed method 
has satisfactory performance. 

The rest of this paper is organized as follows. Section 2 presents the com- 
putational cost problem of kernel methods applied to large scale data sets, and 
proposes the K-EM algorithm for GMM to solve this problem. Section 3 demon- 
strates experimental results of the proposed algorithm on a synthetic data set 
and real world color image segmentation problems. Section 4 concludes. 

2 K-EM Algorithm for GMM 

As is known, EM algorithm often fails in data sets with complex structures 
(nonlinear and non-Gaussian) . One possible way is to find a nonlinear map 
so that the data is projected and well clustered in the mapped feature space. 
That leads to the intuitive idea of our K-EM algorithm. In this section, we first 
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introduce kernel feature analysis method and present great computational cost 
problem in kernel methods for large scale data sets. Secondly, we propose the 
speedup technique for Kernel PCA based on Monte Carlo sampling and mapping. 
Thirdly, we provide the whole K-EM algorithm for GMM. Finally, we give some 
discussions of the proposed algorithm and related work. 



2.1 Kernel Feature Analysis 

Kernel trick [10] is an efficient method for nonlinear data analysis early used in 
Support Vector Machine (SVM). The idea is that we could implicitly map input 
data into a high dimension feature space via a nonlinear function: 



X ^ H 



( 1 ) 



And a similarity measure is defined from the dot product in H as follows: 



k{x,x') = {(j){x),(l){x')) 



(2) 



where the kernel function should satisfy Mercer’s condition [9,10]. 

Besides being successfully used in SVM, kernel trick has been transported to 
many other kinds of algorithms. Among which. Kernel Feature Analysis (KFA), 
including Kernel Principal Component Analysis (Kernel PCA) [11] and Kernel 
Fisher Discriminant (KFD) [9] etc, is a class of methods with the most influence. 

KFA aims to extract features in a nonlinear way. In this paper, we apply 
Kernel PCA as our feature extractor. We must stress that any other kernel fea- 
ture extraction method could be employed in our final framework. Given a set of 
^ with zero means, using the nonlinear mapping and kernel trick de- 
fined as in (1) and (2), Kernel PCA mainly depends on the eigen-decomposition 
problem on a Gram matrix. 

mXa = Ka (3) 

where a = (a^, • • • , is the expansion coefficient, A is the eigen-value and 

K is the m X m Gram matrix with element Kij = {(f>(xi), cj){xj)). 

To extract nonlinear features from a test point x, Kernel PCA computes dot 
product between (j){x) and the eigenvector in feature space to obtain the 
projection y = (yi,--. ,yP). 

y'^ = {v'^Ai.x)) = ^^^^a’^k{xi,x) n=l,---,p (4) 

where p {p < m) is the dimension of extracted features space F, and p is auto- 
matically chosen according to the leading eigenvalues A. 

Kernel PCA is an efficient nonlinear method for feature extraction. The ad- 
vantage of Kernel PCA is that data with complex structure could be well clus- 
tered in feature space. However, the computational cost problem arises when 
Kernel PCA is applied to large scale data sets. 

As is known, through kernel trick, the computational cost of Kernel PCA is 
mainly determined by the eigen-decomposition problem, such as (3) for Kernel 
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PCA. Thus the computational cost directly depends on the size of Gram matrix 
K, i.e. the size of data sets. With the increase of size m, it is liable to meet with 
the curse of dimension. As is known, if m is large, e.g. larger than 5,000, currently 
it is impossible to finish the eigen-decomposition within hours even on the fastest 
PCs. Unfortunately, the size m is often very large in many cases, especially 
for data mining problems and pixels clustering based image segmentation (For 
example, m will be 16,384 for a not very large image with size 128x128). That 
is to say, the application of Kernel PCA method is seriously restricted by the 
size of data sets. Any attempt in this point is significative for the application of 
kernel methods. In the following subsection, we will focus on this point. 

2.2 Speed up Kernel PCA 

by Monte Carlo Sampling and Mapping 

Some techniques are proposed to handle the great computational cost problem 
in Kernel PCA. Most assume that the symmetry Gram matrix has a low rank. 
There are three typical techniques based on that assumption. The first technique 
is based on traditional Orthogonal Iteration or Lanzcos Iteration. The second is 
to make the Gram matrix sparse by sampling techniques [12]. The third is to ap- 
ply Nystrom method to speedup kernel machine [13]. However, all these methods 
still can’t efficiently solve the vast-data problem such as pixel clustering based 
image segmentation. Moreover, none of these techniques has been successful in 
a practical problem. 

Here, we also adopt this basic assumption, but reconsider the problem in a 
different way. Consider that samples forming the Gram matrix are drawn from 
a probabilistic distribution p{x), thus the eigen-problem could be written down 
as a continuous form. 

J k{x,y)p{x)Vi{x)dx = XiVi{y) (5) 

where Xi, Vi{y) are eigenvalue and eigenvector corresponding with the Gram ma- 
trix, and fc(-, •) is a given kernel function. Note that the eigenvalues are ordered 
so that Ai > A 2 • • ■ . 

The integral on the left of Equation (5) could be approximated by using 
Monte Carlo method to draw a subset of samples according to p{x). 

1 ^ 

/ k{x,y)p{x)V^{x)dxKi —^k{xj,y)V^{xj) (6) 

i=i 

Then plugging in y = Xk for j = 1, • • • , A, we obtain a matrix eigen-problem. 

1 ^ 

— XiVi(^X}i^^ (7) 

^ 1=1 

where Xi is the approximation of eigenvalue Xi. 
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Fortunately, this approximation has been proved feasible and has bounded 
error performance [14]. This result is extremely beautiful. It indicates that we 
can approximate the eigenvalues of large size Gram matrix by using only a subset 
of samples. 

Our proposed technique is based on this result. Considering data Xi € X with 
dimension d, Kernel PGA projects data Xi into feature space F to be yi with di- 
mension p, where p m holds true under the basic low rank assumption. Since 
eigenvalues are associated with eigenvectors, the corresponding eigenvectors of 
the approximated eigenvalues form a representative space D, which could be 
viewed as an approximation of feature space F. Our work will utilize this prop- 
erty to construct a representative space D to approximate feature space F in an 
iterative procedure. 

Suppose sample Xi is drawn from known distribution p{x), in other words, 
each sample Xi £ X is associated with a known sampling weight w{xi). In most 
cases, especially in our color segmentation problem, p{x) is not known. However 
it is possible using all samples to estimate a distribution p{x) to approximate 
p{x). We just sample data set X according to p{x) independently T times to 
obtain T subset Si {1 ^ i ^ T), where Si is a subset with N elements (with 
the low rank assumption, N m, N ^ p) which are drawn from X according 
to p{x). Thus the sampling procedure could be viewed as Sampling Important 
Resampling (SIR) [15], which is one kind of Monte Garlo sampling method. 
Afterward, perform Kernel PGA on each Si, and obtain representative space Di. 
We apply Support Vector Regression (SVR) [10] to construct the mapping from 
Si to Di. 



And T maps are combined together to be a nonlinear map from A to H as 
follows. 



When a new sample x comes, it could be projected to the representative 
space D by (9). We should emphasize here that using SVR to learn the mapping 
is due to its capability of generalization, and the reason of using T subsets is that 
the ensemble of T subsets can make result with less variance and more robust. 

The whole sampling, resampling and combining procedure could be viewed 
as development of bagging ensemble [16] technique. 

However, there is still a sticking point in the speedup scheme. That is how to 
obtain the distribution p{x) or how to provide sampling weight information. We 
can’t provide the information in one step, but we adopt the idea from Sequential 
Monte Garlo method that we provide the information in an iterative procedure. 
That we first project all the samples into the representative space, then we 
estimate a distribution in the space and update the sampling weight according 
to the estimated distribution, and then we can draw samples according to the new 
sampling weight information. Iteratively running the procedure till convergence, 
we can obtain the expected result. This is achieved by efficiently combining 
Kernel PGA with GMM in the next subsection. 



y = 1 < i < T 



(8) 




(9) 
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2.3 K-EM Algorithm for GMM 



GMM is a kind of mixture density models, which assumes that each component 
of the probabilistic model is a Gaussian density. That is to say: 



where parameters 0 = (oi, • • • , om; ‘ ‘ i ^m) satisfy cm = CKi ^ 0 and 
Gi(yj0i) is a Gaussian probability density function with parameter 0i = Si). 

EM [17] algorithm has been successfully used in generative models such as 
GMM. However, traditional EM for GMM often fails when structure of the data 
does not conform to mixture Gaussian assumption, especially for image pixels. 
Thus EM has been extended in many ways to overcome this problem. 

D-EM [18] is one of the extensions of EM. It focuses mainly on two cat- 
egorized semi-supervised learning problem, which has two categories with the 
minority samples labeled and the majority unlabeled. D-EM solves the prob- 
lem by embedding a linear transformation step, which finds good fit of the data 
distributions as well as automatically selects good features. However, the linear 
transformation and two categorized learning framework limit its applications. 
Kernel trick is also introduced into D-EM [19], but the existed kernel version of 
D-EM did not start from the idea of speedup kernel feature extraction method 
in a probabilistic framework. 

The proposed K-EM for GMM aims to extend D-EM in at least three aspects. 
The linear transformation will be replaced by Kernel PGA (In fact, any kernel 
feature analysis can be ultilized), the two categorized semi-supervised learning 
problem will be extended to multi-component clustering problem, and GMM 
provide sampling weight information so that Kernel PGA will be sped up greatly 
by Monte Garlo sampling and mapping technique. 

The basic idea of K-EM for GMM is that it will efficiently combine two oper- 
ations of Kernel PGA and parameters estimation of GMM. Kernel PGA extracts 
good features for GMM, whereas GMM provides sampling weight information 
needed by the proposed speedup technique of Kernel PGA. That is to say, in 
iteration step (t — 1), we estimate a distribution in representative 

space D using GMM. Then we update the sampling weight of sample Xi by the 
following equation since each Xi corresponds to one 



w^^-^Hxi) 






( 11 ) 



Gonsequently, we do the sampling important resampling and mapping step ac- 
cording to the updated weight information and the scheme described in previous 
subsection. 

To summarize, for a given data set A, the proposed K-EM algorithm for 
GMM iterates over three steps. That is “Expectation-Kernel feature extraction- 
Maximization” . The detail algorithm of K-EM for GMM is shown in Table 1. 
The algorithm initializes all the samples with the same sampling weight 1/m. In 
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Table 1. The detailed K-EM algorithm for GMM. 



Input: Data set X 

Output: Clustering parameters 0 of GMM 

SI: Initialize all samples with the sampling same weight 1/m, number of clusters 
C, largest iteration steps L, iterating counter t — 0, number of subsets T 
and size of subsets N. 

S2: Iteration counter t = t + 1 

— E-Step does sampling in set X according to sampling weight information by 
M-Step to obtain T subsets Si. 

— K-Step performs Kernel PCA on each subset Si, projects data in Si into 
the representative space Di, then learns the mapping function (8) between 
the input space and the representative space of set Si. The T learned maps 
are combined to obtain the final map by (9), thus all the samples in X are 
projected into the representative space D by (9). 

— M-Step performs parameters 0 estimation of GMM and updates sampling 
weight of each sample by (11). 

S3: Test whether convergence is reached or t > L. If not, loop back to S2, 
otherwise return parameters 0 and exit. 



other words, the first E-Step performs a uniform sampling. The algorithm could 
not be terminated until the parameters converge or the presetting iteration steps 
are reached. 

2.4 Discussion of the Algorithm 

As we have mentioned in previous section, the problem we addressed is to apply 
eigen-decomposition based kernel methods to large scale data sets. If the data 
set size m is large enough, e.g. larger than 5,000, the eigen-decomposition is 
intractable. However, the time complexity of the proposed method depends upon 
the subset size N instead of whole data set size m, and N could be much less 
than m (TV m), thus the proposed method can solve the problem indirect but 
much more efficiently. That could be viewed as our motivation. 

There are still some other related work besides D-EM and its kernel version. 
One of the most influential methods is the particle filter or bootstrap filter [20] . 
Particle filter also iteratively uses weight updating and important sampling, 
which is called Sequential Important Sampling (SIS). And in our algorithm, 
the samples in subset could be viewed as particles in particle filter, and size 
of them is also fixed. But there are still two obvious differences. First, particle 
filter only uses particles to approximate the data distribution, but our method 
projects all data into feature space and approximates the data distribution by 
GMM. Second, particle filter only performs in the input data space, whereas our 
method performs in a feature space by Kernel PCA. 

The other one is spectral clustering [21]. Spectral clustering could be regarded 
as first using RBF (only using RBF) based kernel methods to extract features. 
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and then performing clustering by k- means. It is not an iterative procedure. 
Moreover, it still can not deal with problems with large scale data set. 

Our proposed method combines the idea of particle filter and that of spectral 
clustering, and could be regarded as a kernel extension of particle filter and a 
sequential version of spectral clustering in some sense. 



3 The Experimental Results 

To provide an intuitive illustration of the proposed algorithm, we firstly demon- 
strate the K-EM for GMM on a synthetic 2-D data set, and then on real world 
color image segmentation problem. 

3.1 Synthetic 2-D Data Clustering 

The synthetic data set with 2,000 samples is depicted in Fig 2. Traditional GMM 
based on basic EM, using coordinates of sample points as features, partitions 
the data set into two clusters as shown in Fig 2(b). The result is obviously not 
satisfying. As a comparison, the proposed K-EM for GMM is also employed this 
task using a polynomial kernel 

k{x, x') = {x ■ x'Y (12) 

with degree d = 2, and setting the representative dimension p = 4, subset Si 
with size N = 15. In order to observe the sampling procedure easily, we just 
set number of subset T = 1. With these paramters, we achieve the promising 
results shown in Fig 2(a). Fig 2(c) shows the 15 samples drawn at the beginning 
of the algorithm, and Fig 2(d) shows the 15 samples drawn at the end of the 
algorithm. It is obvious that samples in Fig 2(d) provide more information about 
the structure of the data set than that of Fig 2(c). 

What we still want to emphasize here is the running time compared with 
spectral clustering. On a Pentiumd 2.0GHZ PG, the proposed algorithm could 
finish the whole clustering problem within 10 seconds, but spectral clustering 
need half an hour to achieve the same result using RBF kernel function 

fc(a;, x') = exp(— 7 ||a; — x'll^) (13) 

where sets 7 = 10. We could see that the proposed algorithm achieves satisfied 
result as well as reduces the computational cost noticeably. All these demonstrate 
the power of the proposed algorithm. 

3.2 Color Image Segmentation 

In this part, we present some experiments on color image segmentation. 

As Kernel PGA will project original color space to nonlinear space, the orig- 
inal color space we adopt is linear color space, that the RGB color space. As 
mentioned in the first section, the spatial information is also utilized for the 
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Fig. 1. Intuitive illustration of K-EM for GMM. (a) Clustering result by the proposed 
method, (b) Clustering result by GMM based on basic EM. (c) 15 Samples drawn at 
the beginning of the K-EM algorithm (marked by ‘x’). (d) 15 Samples drawn at the 
end of the K-EM algorithm. 



clustering. The input features for our algorithm are (r, g, &, x, where (x,y) 
is coordinate and (r, g, b) is RGB value of the corresponding pixel. We choose 
Kernel PCA as the kernel feature extraction method, and RBF kernel as in 
Equation (13) is used in all segmentation experiments. Some parameters could 
almost be fixed as dimension of subsets number T = 10, representative space 
p = 7, and samples in each subset N = 400 for most segmentation problem. 
Number of clusters C and parameter of RBF kernel 7 need to be determined 
according to practice. 

As a comparison, GMM based on basic EM also performs the same task. And 
it adopts the same clusters number as the proposed method, but only uses the 
RGB color space as the feature space for clustering. 

Results are shown in Fig 2. First column depicts the original images, second 
column depicts the corresponding results by our K-EM for GMM, and the third 
column depicts the result by GMM based on basic EM. Both examples set the 
clusters number C to be three (C = 3). Parameter of RBF kernel is set to be 
7 = 0.15 and 7 = 1 for Fig 2(b) and Fig 2(e) respectively. It is obvious that 
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the results achieved by our method are satisfactory and much better than GMM 
based on basic EM. 




(d) (e) (f) 



Fig. 2. Color Image segmentation results achieved by the proposed algorithm and basic 
GMM. Original images are depicted in the first column, the corresponding results by 
our K-EM for GMM are depicted in second column, and results by GMM based on 
basic EM are depicted in the third column. Both examples set the clusters number C 
to be three (C=3). 



4 Conclusion 

In this paper, we extend EM algorithm to K-EM by involving one kernel feature 
extraction (in this paper, kernel PCA) step. The proposed K-EM for GMM 
efficiently combines Kernel PGA and GMM, where GMM provides sampling 
weight information needed by the speedup technique of Kernel PGA, whereas 
Kernel PGA extracts good features for GMM. The advantage of the combination 
is that we could avoid the great computational cost, which could be encountered 
when directly performing eigen-decomposition of the Gram matrix in problems 
with large scale data sets. That is just the original intention of our proposed 
algorithm. 

Experiments have been done on synthetic data set and real world color image 
segmentation problem by our algorithm. Results show that the algorithm has a 
satisfactory performance and does much better than GMM based on basic EM. 
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The attempts in color image segmentation is significative since we present a way 
to solve real computer vision problem efficiently by kernel methods. 

Since the proposed K-EM algorithm needs to preset the number of clusters 
C, our future work will focus on automatically selecting C with methods such 
as Reversible Jump Markov Chain Monte Carlo as in [22]. We also intend to in- 
tegrate multi-scale information to improve the color image segmentation results. 
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Abstract. We consider the topographic clustering task and focus on the problem 
of its evaluation, which enables to perform model selection: topographic cluster- 
ing algorithms, from the original Self Organizing Map to its extension based on 
kernel (STMK), can be viewed in the unified framework of constrained cluster- 
ing. Exploiting this point of view, we discuss existing quality measures and we 
propose a new criterion based on an F-measure, which combines a compacity 
with an organization criteria and extend it to their kernel-based version. 



1 Introduction 

Since their definition by Kohonen [1], Self Organizing Maps have been applied in var- 
ious domains (see [2]), such as speech recognition, image analysis, robotics or organi- 
zation of large databases. They solve a topographic clustering task, i.e. pursue a dou- 
ble objective: as any clustering method, they aim at determining significant subgroups 
within the whole dataset; simultaneously they aim at providing information about the 
data topology through an organized representation of the extracted clusters, such that 
their relative distance reflects the dissimilarity of the data they contain. 

We consider the problem of the results evaluation, which is an important step in a 
learning process. Defining a quality measure of the obtained model enables to perform 
model comparison and thus model selection. Many criteria have been proposed but most 
of them fail to take into account the double objective of clustering and organization. 

To address the evaluation question, we show that the various topographic clustering 
approaches can be viewed as solving a constrained clustering problem, the difference 
lying in the expression of the constraint which conveys the organization demand. We 
propose a new criterion which estimates the map’s quality by combining, through an 
F-measure [3], an evaluation of its clustering quality with a measure of its organiza- 
tion quality; we apply it to classic and kernel-based maps to perform hyperparameter 
selection and data encoding comparison. 

2 Constrained Clustering 

We divide the various formalizations proposed for topographic clustering and summa- 
rized in table 1, p. 269. in four categories and highlight the way they express the con- 
straint. 

N. Lavrac et af (Eds.): ECML 2003, LNAI 2837, pp. 265-276, 2003. 
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2.1 Neural Networks 



A first category of topographic algorithms is based on a neural network representation. 
Each neuron is associated with a position Zr and a weight vector Wr which represents 
a cluster center and has the same dimension as the input data. A topology is defined 
on this neurons set, through a K x K neighborhood matrix, where K is the number of 
nodes. It is defined as a decreasing function of the distance between positions, e.g. 



hrs 



= exp 



V ) 



( 1 ) 



The Self Organizing Maps (SOM) algorithm introduced by Kohonen [1] is defined 
by the following iterative learning rule: 



Wr{t+ 1) = Wr{t) + athrg(oot){t){Xt ~ Wr{t)) (2) 

with g(a:t) = argmin lixt - WsiP , 

S 

where xt is a data point; the neighborhood term hrs and the learning rate at decrease 
during the learning procedure; g{xt) denotes the winning node, i.e. xt nearest neuron 
in terms of weight. At each step, the similarity between a data point and its winning 
node is increased; as a result, similar data are assigned to the same node, whose weight 
vector corresponds to an average representant of its associated data. Moreover, through 
the coefficient h^g^xt)^ ^ data point modifies the weights of its node’s neighbors: the 
organization constraint is expressed as an influence sphere around the winning node, 
which affects its neighbors. The parameter ah{t) monitors the width of the influence 
areas and thus the distance at which the organization constraint is still to be felt. 

Heskes [4] showed that the learning rule (2) cannot be derived from an energy 
function when the data follow a continuous distribution, and therefore lacks theoret- 
ical properties (see also [5]). Thus, he proposes to train the network by optimizing an 
energy function that leads to a slightly different map, which still fulfills the topographic 
clustering aims. In the case of a finite dataset X = {xt, i = l..A^}, it is defined as 

^ N K K 

E = —'^'^hrg(xi)\\x^-Wr\\'^ with g{xi) = argmin'^hstWx^-WtW'^ . (3) 

i—1 r— 1 t—1 

Thus, the winning node is not only Xt nearest neighbor as in SOM, but takes into ac- 
count the resemblance to neighbor nodes. If h is defined as in (1), Vr, hrr = 1, we 
propose to write E = E\ + E2 with 

1 AT ^ JV 

Ei = ^'^\\xi-Wg^x,)f and ^2 = Kg{xi)\\x^ - Wrf , (4) 

i—1 i—1 r^g(xi) 

and thus to interpret i? in a constrained clustering context: Ei equals the cost function 
optimized by the fc-means algorithm. E 2 imposes the organization: when it is minimum, 
neighbor cells, corresponding to high values, have similar weight vectors. In- 

deed, \\wg(^xi)~Wr\\'^ < \\'^gixi)~Xi\\'^ + \\xi — Wr\\'^ where the first term is low because 
of El minimization and the second because of the term h^g^xi) \\xi — in i? 2 - 
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The parameter ah can be interpreted in a regularization framework: it monitors the 
relative importance of the main goal, clustering, and the imposed constraint, and thus 
the number of free parameters. When it is low, most h^s terms are zero and the dominant 
term in E is Ei\ when E 2 becomes prevalent (for high ah values), the map falls in 
a degenerate state where only the furthest nodes on the grid are non empty. Indeed, 
for such neurons, the weight is exclusively determined by the constraint and not by a 
tradeoff taking into account its assigned data, which enables to maximize organization. 

Graepel, Burger and Obermayer [6] propose a deterministic annealing scheme to 
optimize this cost function, which leads to global and stable minima; beside the weights 
Wr, it provides assignment probabilities C Cr) C [0, 1], where = {xi/g{xi) = 
r}. The associated algorithm is called Soft Topographic Vector Quantization (STVQ). 

2.2 Markov Chains 

Luttrell [7] considers topographic clustering as a noisy coding-transmission-decoding 
process, which he models by a specific Markov chain, called Folded Markov Chain 
(FMC): it consists in a chain of probabilistic transformations followed by the chain of 
the inverse (in a Bayes’ sense) transformations. 

He shows that SOM are a specific case of a two-level FMC. The first level cor- 
responds to the coding step, which is equivalent to a clustering phase: it assigns data 
according to rule defined in equation (3) and codes them by the corresponding vector. 
The second level represents the transition probability to other clusters, it is fixed a pri- 
ori and enables to express the constraint in the same way as a normalized neighborhood 
matrix (see [6]). The optimized cost function is defined as the reconstruction cost; it is 
equivalent to the function (3) if the neighborhood matrix is normalized. 

2.3 Probability Distribution Modelling 

Other formalizations aim at explicitely modelling the data probability distribution. 

Utsugi [8] considers a bayesian framework to learn a gaussian mixture, constrained 
through a smoothing prior on the set W = {wr, 1 < x < K}. The prior is based on a 
discretized differential operator D: 

p(W/a) = J]Oexp(-||liTu;(,)f) with C = (det+iT^iT)^ , 

(5) 

where wq) is the vector of the jth components of the centers, I = rank {D"'" D), and 
det^D^D denotes the product of the positive eigenvalues of D^D. Thus, a weights set 
is a priori all the more probable as its components have a low amplitude evolution, as 
expressed by the differential operator D. The centers Wr are learnt by maximizing the 
penalized data likelihood, computed as a gaussian mixture with this prior on centers; a 
monitors the importance of the constraint. 

Bishop, Svensen and Williams [5] also consider a gaussian mixture, based on a 
latent variable representation: a data x G Ef is generated by a latent variable z G C of 
lower dimension I < d, through a function ip of parameters A: x = ip{z\ A). Denoting 
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by (3 the variance of a gaussian noise process, and defining p{z) as the sum of functions 
centered at nodes of a grid in C, p{z) = l/K ~ ^r), p{x) is defined as 

p(a:/A/3) = exp(^-^\\'ip{zr;A) - x\\^'^ , (6) 

which corresponds to a constrained gaussian mixture: the centers tfj{zr;A) cannot 
evolve independently, as they are linked through the function tp, whose parameters A 
are to be learnt. The continuity of '0(-; A) imposes the organization constraint: two 
neighbor points and Zb are associated with two neighbor images tp{zA]A) and 
'lp{ZB]A). 

Heskes [9] shows that the energy function (3) can be interpreted as a regularized 
gaussian mixture: in a probabilistic context, it can be written as the data likelihood plus a 
penalization term, defined as a deviation of the learnt center Wr from the value imposed 
by organization Wr = ^rsWg- The solution must find a tradeoff between adaptating 
to the data and abiding by a low deviation, thus it solves a constrained clustering task. 

2.4 Kernel Topographic Clustering 

Graepel and Obermayer [10] propose an extension of topographic clustering, called 
Soft Topographic Mapping with Kernels (STMK), using the kernel trick: it is based on 
a non-linear transformation p : TZ‘^ — > to a high, or infinite, dimensional space, called 

the feature space; it must enable to highlight relevant correlations which may remain 
unnoticed in the input space. STMK transposes the cost function (3) in T, by appplying 
STVQ to 4>{xi); the centers, denoted wf, then belong to T . The cost function becomes 

N K K 

withg(a:i) = argmin'^ hstUixi) - tuff. 

i=l r— 1 t—1 

Provided tuf is searched as a linear combination of (p{xi), as tuf = the 

computations are expressed solely in terms of dot products < cj>{xi),(p{xj) > [10]. 
Thus, defining a kernel function such that < cj>{xi), (p{xj) >= Xj), it is possible 
to optimize without doing costly calculations in the high dimensional space E. This 
algorithm being a direct transposition of STVQ to 4>(xi), it has the same interpretation 
in terms of constrained clustering, in the feature space. 

3 Topographic Clustering Evaluation 

Table 1 summarizes the previous algorithms. Whichever choice is made, the result map 
must be evaluated, to determine its validity and possibly to perform a posteriori model 
selection: in topographic clustering, it implies choosing the appropriate neighborhood 
parameter and the adequate size of the grid\ plus the kernel parameter in the kernelized 
approach. According to the previous constrained clustering framework, maps must be 

* assuming the dimension of the visualization space is 2. 
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Table 1. Summary of some caracteristics of topographic clustering algorithms (see section 2). 



Designation 


Principle 


Learning 

algorithm 


Probabilistic 

modelling 


Constraint 

expression 


Associated 

references 


SOM 


neural net 


iterative mle 


no 


influence areas 


[2] 


STVQ 

STMK 


neural net 


deterministic 

annealing 


possible 


influence 

area 


[6,4, 9] 
[10] 


EMC 


probabilistic 

transformation 


EM 


yes 


probabilistic 

influence 


[7] 


Utsugi 


gaussian 
+ prior 


EM 


yes 


smooth weight 
differential 


[8] 


Bishop 
et al. 


latent 

variable 


EM 


yes 


continous gene- 
-ration process 


[5] 



assessed along two lines: their clustering capacity and their respect of the constraint, 
i.e. their organization. Yet most existing measures only take into account one aspect; 
using the notations of section 2.1, we discuss some of the existing criteria. 

3.1 Clustering Quality 

Kohonen [2] proposes to use the classic criterion called quantization error, which is the 
cost function of the fc-means algorithm and is defined as the cost of representing a data 
X hy the center of the cluster it is assigned to: 

N K 

<lCi = \\xi-Wr\\^ . (7) 

i=l r=li/xi£Cr 

For topographic clustering, contrary to clustering, the center of a cluster and its mean are 
distinct, as centers are influenced hy their neighbors due to the organization constraint. 
Thus, computing the distance to centers introduces a bias in the homogeneity measure, 
and under-estimates the clustering quality. We propose to measure the cost obtained 
when representing a data by the mean of the cluster Xr : 

N K 

7^1 = II®* = ]v ^ ^ . (8) 

i=l »'=li/xiGCr 

Only the identified subgroups intervene in the measure, which makes it a justified clus- 
tering quality criterion. 

Compacity can also be measured by the average variance of clusters [11]: 

1 ^ 1 

qM 2 = 7777 ll^i— K* = number of non empty clusters. (9) 

r=l ' ’'I ijxieCr 
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One can notice that qMi is also a weighted average variance, whose weighting coef- 
hcients are \Cr\K* /N i.e. the quotient between the cluster cardinal and an average 
cluster cardinal, under an equi-distribution assumption. 

Some learning algorithms, like STVQ, provide assignment probabilities p(a;i S Cr) 
which are normalized so that Vi, ^ = 1 and equal the conditional prob- 

abilities p{Cr/xi). They lead to a probabilistic quantization error, qMf computed by 
averaging the individual probabilistic errors, and a probabilistic variance mean qM ^ : 



qMf 



qMP 



1 ^ 



K 

with = '^p{Cr/x^)\\x^ - Xr\\^ , 

r—1 

N 

with Cr^(C'r) = '^p{Xi/Cr)\\Xi -Xr\\^ 

i=l 



where Xr 



1 

Y.jPi.Xj G Cr) 



N 

^ ^ P(,Xi G Cr)Xi 



(10) 

(11) 

( 12 ) 



Likewise, one can define a probabilistic equivalent to qCi. As previously, the dif- 
ferences between qMf and qM^ come from normalization coefficients: considering 
equiprobable data, p(xijCr) = p{Cr ! Xi) j piCrlxj). 



3.2 Organization Quality 



The organization criteria can be divided in three groups. The hrst measure was proposed 
by Cottrell and Fort [12] for one dimensional maps, as the number of inversions, i.e. the 
number of direction changes, which evaluates the line organization. It was generalized 
to higher dimensions by Zrehen and Blayo [13]. 

A second category is based on the data themselves and uses the winning nodes: 
if the map is well organized, then for each data the two best matching units must be 
adjacent on the grid. This principle has inspired measures such as the topographic error 
[14], Kaski and Lagus criterion [15], or the Hebbian measure [16]. 

Some organization measures are computed using only the neurons, without the 
data, which leads to an important computational saving and is more independent from 
the learning dataset. They evaluate the correlation between the distance in terms of 
weights and the distance imposed by the grid, that is dWrs = ||tr'r — Ws|P and dCrs = 
ll^r — Zs|p. Indeed, the aim of organization is that the nearer the nodes the higher their 
similarity in terms of weight vectors. Bauer and Pawelzik [17] evaluate the conservation 
of ordering between nodes sorted by dW or dG. Flexer [18] evaluates the organization 
by a measure of correlation on the distance matrices. It only considers the map itself, 
without taking into account the data whose role is reduced to the training phase. Denot- 
ing for any KxK matrix A, EA = ^ ■ ■ Aij, and Na = (EA"^ — {EA/K)'^'), he uses 
the Pearson correlation: 



P = 






y/NcNw 






(13) 
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3.3 Combination 

The previous measures do not consider the double objective of topographic clustering, 
but only one of its aspects; only two measures evaluate the compromise quality. 

In the case of probabilistic formalizations, the result can be evaluated by the pe- 
nalized likelihood of validation data. This measure evaluates the two objectives as it 
integrates the probability distribution on the weights, which expresses the constraint. 

Independently of the learning algorithm, one can use as criterion a weighted quan- 
tization error = E, where E is the function (3) whose decomposition (4) shows it 
considers both clustering and organization. Yet, it does not enable to select an optimal 
(Jh value if the h matrix is not normalized: when at is small, most /irg(xi) terms are low; 
it appears that when au increases, the augmentation of the number of summing terms 
entails a more important increase than the decrease of cost due to a better organization. 
Thus, qw augments, without its reflecting a real deterioration of the map’s quality. 



4 Proposed Criterion 

To evaluate globally a topographic map, we propose a criterion combining a clustering 
quality measure with an organization measure, which we extend to kernel-based maps. 



4.1 Classic Topographic Clustering 

To measure the clustering quality, we choose a normalized expression q^ = q/rj of the 
criteria presented in section 3.1, g = gCf, qMf, or qM^. The normalization aims at 
making the measure independent of the data norm scale. We propose to define: 




i=l 



with X = 




i=l 



(14) 



If g = qM^ or qCf, rj is interpreted as an a priori quantization error, obtained when all 
data are coded by the mean of the dataset. If g = qM^, rj is seen as the variance of the 
dataset before any subdivision. The criterion g^ = g/p constitutes a clustering quality 
measure which varies in the interval [0, 1] and must be minimized. 

As organization measure, we choose a criterion derived from the Pearson correlation 

c=^-^€[0,l] . (15) 



(jp only depends on the data, c on the weight vectors, which makes them inde- 
pendent. We combine them through the F-measure defined by Van Rijsbergen [3] and 
classically used in the Information Retrieval field to combine recall and precision. We 
apply it to c and 1 — g^ which are both to be maximized and define the global criterion 
Qb 



{1 + - qp)c 

62(1 - <ip) +c 



(16) 



which must be maximized too. 6 is a weighting parameter controlling the relative im- 
portance of the two aims in the evaluation: if 6 = 2 for instance, Qb rewards a high 
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organization four times more than a good clustering. Thus this classic measure offers a 
mean to agregate in a single quantity the two criteria, and provides a numerical value, 
which always belong to the interval [ 0 , 1 ], to compare different maps; its advantage 
comes from the flexibility provided by the b agregation weight which allows the user to 
dehne numerically a tradeoff level between the two objectives. 



4.2 Kernel-Based Topographic Clustering 

The evaluation of the kernel-based topographic map requires us to compute the previous 
measure without computations in the feature space, imposes to evaluate \\<l){xi) —xf\\ 
and r}'^\ both can be expressed solely with the kernel matrix = k{xi, Xj): denoting 
Pir — p{Ct !xi) et OLir — Pirj ^ Pjr'i have 

N N 

II — ^ ^ ^ O^jj-kij T ^ ] (Xjj-CXli-kji 

= w E *=1 Ui^i) = w (E»=i kii - jj- kij'j ■ 

Thanks to the normalisation 77 '^, a small value is not due to the kernel itself, but 
indicates that the corresponding feature space defines a data encoding which highlights 
the presence of homogenous subgroups in the dataset. 

The adaptation of requires to compute dWrs = \\wf — wfjp. Using the decom- 
position Wj' — y we have ^ 1—1 kn(^CLi'p(iij- -F 

The global quality of the map is then computed without too important additional 
costs as the F-measure between 1 — qf and c'^: 



(1 + 6^K1-^ 



(17) 



5 Numerical Experiments 

The numerical experiments highlight the relevance of the proposed criterion for map 
evaluation, and model selection including the algorithm hyperparameters (grid size K 
and neighborhood parameter ah), the kernel hyperparameters (type and parameter) and 
the data encoding. They are based on the STMK algorithm applied to a 2D square 
map. Indeed, STMK contains the classic maps as a special case, using the linear kernel 
k{x, y) = {x ■ y) /d which is equivalent to the scalar product in the input space. 



5.1 Criterion Validation and Hyperparameter Selection 

We study the behavior of the proposed criterion on an artificial 2D database, varying 
the hyperparameters. The base is generated by two distributions: a gaussian centered 
along a parabolic curve, and an isotropic gaussian (see hg. 1). As the data belong to 
TZ^, the resulting clusters can be visually displayed, by representing data belonging to 
a same node with a same symbol; the organization is represented by joining the means 
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Fig. 1. Maps obtained with the optimal hyperparameter values. 1. Best linear map for Qo.s, 
{K,Oh) = (49,0.30). 2. Best linear map for Q 2 , {K,ah) = (16,0.28). 3. Best 4x4 gaussian 
map for Qj > = (0.28, 1). 4. Best 4x4 polynomial map for Q 2 , {<Jh,rnk) = (0.28, 2). 



-- qcf 




Neighborhood parameter 



Neighborhood parameter 



Fig. 2. Variation of the clustering criteria gCf , et gMf on the left, qM^ on the right, as functions 
of the neighborhood parameter cth for various grid sizes K — . Caption: x k = 3, o 44 

k = 4, *44k = 5, □44k = 6, 044k = 7. 




(computed in 'R?) of non empty clusters corresponding to adjacent nodes (in the kernel- 
based case, the centers wf belong to the feat ure space T, and cannot be represented). 

Figure 2 represents, for the linear kernel, the evolution of the clustering criteria 
gCf, and qMf (left), and qM^ (right), as functions of ah for various K values. All 
are monotonous functions of K and cr^ : the clustering quality is higher if the number 
of clusters is high and the organization constraint is low; thus they are not sufficient to 
select optimal parameter values. One can notice that the difference between gCf and 
qMf remains low; yet, as qMf evaluates the clustering quality using only the identified 
data subgroups, it appears as more satisfying on an interpretation level. Lastly the range 
of qM 2 (right graph) is smaller than that of qCf and qMf : it is less discriminant and 
thus less useful to compare maps. In the following, we shall keep the qMf measure. 

Figure 3 represents the evolution of Q 0.5 (left) and Q 2 (middle), for a linear kernel. 
They are not monotonous and indicate optimal values, which depend on b. We chose 
to test b = 0.5 and 6 = 2 to favor respectively each objective: for b = 0.5, clustering 
is considered as the main task, thus the optimal K value is the highest tested value. 
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Neighborhood parameter 




Fig. 3. Variations of Qo.s on the left, Qi in the middle, as functions of \ on the right, “trajec- 
tory” of the map in the (l — q,c) plane when an varies ; for different grid sizes K = . Caption: 

x<t=>K = 3, o<t=^K = 4,* <t=>«; = 5, Q<;=>k = 6, 0<t=>K = 7. 



iC = 49; if 6 = 2, the organization demand is stronger, large grids which are difficult 
to organize obtain a lower score, and the optimum is iC = 16. 

The right graph of fig. 3 shows the “trajectory” of the map in the (1 — g, c) plane, 
when ah varies, for different grid sizes, and highlights ah influence: small ah lead to 
a high quality clustering, but a poor organization. The parameter b, which expresses 
the relative importance of the two objectives in the evaluation phase enables the user 
to define the tradeoff level he desires: the points denoted A and B correspond to the 
optima associated with Qq .5 and Q 2 and represent two different compromises. From 
our experiments, it seems that 6 = 2 is a good compromise in a visualisation framework. 

Graphs 1 and 2 of fig. 1 show the maps obtained with the optimal {K, ah) values 
for Qo .5 and Q 2 respectively. For b = 0.5, the clusters have low variances, but their 
organization is not satisfying: the chain of clusters associated with the parabolic data is 
too sensitive to data. For 6 = 2, it is more regular, and the map distinguishes between 
the two generative sources; it reflects the intern structure of data, in particular their 
symmetry. In the following, we conserve the value 6 = 2, which better reflects the 
visualisation goal, and consider 4x4 maps. 

We tested STMK with the polynomial kernel kp and the gaussian kernel kg 

kp{x,y) = + 1^ kg{x, y) = exp ^ 

Figure 4 represents as a function of ah for various values of ak (resp. ruk), for kp 
and kg. It shows that a large ah range leads to similar results. It also indicates that the 
gaussian kernel outperforms the linear one: optimal value is 0.941 in the gaussian 

case, and 0.917 in the linear case. The associated graphs (2 et 3, fig. 1) are very similar, 
the evaluation difference has a double reason: the slight assignment differences are in 
favor of the gaussian kernel; moreover, even identical clusters appear as more compact 
in the feature space than in the input space and lead to a better score. This is justified as 
the higher compacity leads to a faster convergence (5.3 times faster with these parameter 
values). According to Q 2 , the polynomial kernel gives poor results which is confirmed 
by the graphical representation (graph 4, fig. 1): the optimal polynomial map enables to 
distinguish the two sources but lacks organization. 

This artificial base highlights the fact that the quality criterion based on the F- 
measure enables to select the hyperparameters values that indeed correspond to opti- 
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parametre de volsinage 

Neighborhood parameter 



Fig. 4. Variations of Qj. for gaussian (left) and polynomial (right) kernels, as function of ah- ° 
corresponds to the linear map ; right, x to Cfc = 0.1, + to Cfc = 0.5,* to cfc = 1, □ to ak = 1.5, 
O to (Tfc = 1.7, A to (Tfc = 2 ; left x to rrik = 2, + to nik = 3,* to rrik = 4, □ to rrik = 5. 



Table 2. Best gaussian parameter combinations for various databases. The correspondance with 
the newsgroups is: 1 = alt. atheism, 2 = comp.graphics, 3 = rec. autos, 4 = rec.sport.hockey, 5 = 
sci. crypt, 6 = sci.electronics, 7 = soc.religion.christian, 8 = talk.politics.guns. 



Dataset 

content 


tfidf encoding (500 attributes) 
Gh Gk 1- Qt 


mppca encoding (20 attributes) 
(^h Gk 1- 


Oi 


: 2, 3, 5, 8 


0.14 


2 


0.43 


0.73 


0.645 


0.18 


0.5 


0.66 


0.79 


0.761 


©2 


: 1, 2, 6, 8 


0.14 


1.5 


0.36 


0.72 


0.601 


0.22 


1 


0.69 


0.78 


0.762 


©3 


: 3, 4, 6, 7 


0.14 


1.7 


0.32 


0.74 


0.582 


0.24 


1.5 


0.69 


0.79 


0.769 



mal maps, by both rewarding good maps and penalizing bad ones. It also enables to 
highlight the relevance of kernels in solving a topographic clustering problem. 

5.2 Data Encoding Comparison 

We applied the proposed criterion to compare two document encodings: the tfidf method 
and a semantic based representation, called mppca, proposed by Siolas and d’Alche- 
Buc [19]. The latter exploits, through Fisher score extraction, a generative document 
model combined with a generative word model which captures semantic relationships 
between words thanks to a mixture of probabilistic PCAs. The dataset is built from the 
20 newsgroup database^ by selecting 100 texts of four different newsgroups. These 400 
documents are encoded either by a 20 PCA mixture learnt on a 4x200-text set, or by the 
tfidf also learnt on this set, with a 500-word vocabulary. Table 2 presents the caracteris- 
tics of the best 7x7 gaussian maps. It shows the relevance of the semantic based model 
for the topographic clustering task: it leads to far better results, both globally and in- 
dividually on clustering and organization. These tests on an unsupervised learning task 
confirm the results obtained in a supervised framework [19]. 

6 Conclusion 

We have presented topographic clustering algorithms, from the original formulation by 
Kohonen of Self Organizing Maps to the Soft Topographic Mapping with Kernel ex- 
tension which enables to use the kernel functions, in the same context of constrained 

^ http : / /www. ai .mit . edu/people/ j rennie/20Newsgroups/ 
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clustering, and considered the map evaluation problematic. We defined a new crite- 
rion which flexibly combines by an F-measure a clustering quality criterion with an 
organization criterion. The numerical experiments show it constitutes an efficient map 
comparison tool and enables to perform hyperparameter selection. Its main advantage 
lies in its flexibity which makes it possible for the user to explicitely define the tradeoff 
level between the two contradictory objectives of self organizing maps; thus it adapts 
itsef to the user’s demands. 

The next step of our work consists in applying bootstrap or other robust statistical 
method of estimmation to the proposed evaluation measure. The perspectives also in- 
clude the application of the criterion to micro-array data where visualization is at the 
heart of the problematic and where such a criterion would enable to objectively select 
the best maps. 
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Abstract. Text classification, whether by topic or genre, is an impor- 
tant task that contributes to text extraction, retrieval, summarization 
and question answering. In this paper we present a new pairwise ensem- 
ble approach, which uses pairwise Support Vector Machine (SVM) clas- 
sihers as base classifiers and “input-dependent latent variable” method 
for model combination. This new approach better captures the charac- 
teristics of genre classification, including its heterogeneous nature. Our 
experiments on two multi-genre collections and one topic-based classi- 
fication datasets show that the pairwise ensemble method outperforms 
both boosting, which has been demonstrated as a powerful ensemble 
approach, and Error-Correcting Output Codes (ECOC), which applies 
pairwise-like classifiers for multiclass classification problems. 



1 Introduction 

Text classification, the problem of assigning documents to predefined categories, 
is an active research area in both information retrieval and machine learning. 
It plays an important role in information extraction and summarization, text 
retrieval, and question-answering. In general, text classification includes topic- 
based text classification and text genre-based classification. Topic-based text 
categorization, which is classifying documents according to their topics, has been 
intensively studied before [24,26]. However, texts can also be written in many 
genres, for instance: scientific articles, news reports, movie reviews, and adver- 
tisements. Genre is defined on the way a text was created, the way it was edited 
and published, the register of language it uses, and the kind of audience to whom 
it is addressed [16]. 

Previous work on genre classification recognized that this task differs from 
topic-based categorization [16,7]. A single genre, such as “written newswire” may 
encompass a range of topics, e.g., sports, politics, crime, technology, economy and 
international events. On the other hand, many articles on the same topics can 
be written in different genres. Therefore, the genre-topic mapping is many to 
many. Genre collections, such as ours discussed later, contain different genre 
covering with the same topic, newswire, radio news and TV news, in order to 
evaluate automated genre classification independent of topic classification. One 
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way in which these two classification problems differ is that in general genre 
classification seldom exhibits individual features that highly predict a category, 
unlike topic classification, where words such as “umpire” and “RBI” directly 
predict the “baseball” category and indirectly the “sports” category. 

Given the task of genre classification, the next questions are: How can we 
build accurate methods according to the characteristics of the genre data? Can 
we partially reuse the extensive body of work on topic classification? This paper 
explores aspects of these questions. 

There have been many attempts to extract meaningful linguistic features 
to improve the prediction accuracy, such as POS tagging, parsing, number of 
punctuation, and layout features. However, many of those features (such as POS 
tagging or parsing) require high computational costs with little performance 
improvement; furthermore, for some text sources such as video, capitalization, 
punctuation and other such information are lost in the automatically speech- 
recognized transcript from the audio stream. Therefore it is useful to address 
genre classification using “bag of words” features only, which is the same for 
topic-based classification. Thus, instead of extracting other potential features, 
we focus on identifying the characteristics of the data in genre classification and 
propose suitable learning models accordingly. 

Typically, most data for genre classification are collected from the web, 
through newsgroups, bulletin boards, and broadcast or printed news. They are 
multi-source, and consequently have different formats, different preferred vocab- 
ularies and often significantly different writing styles even for documents within 
one genre. Namely, the data are heterogenous. To illustrate this point, we pro- 
vide an excerpt of two documents from the same genre, “bulletin-board”, in our 
collected corpus: 

— Example-1: GSA announces weekly Happy Hours! Where: Skibo Coffeehouse 
When: Friday’s 5-7pm What: Beer, Soda and Pizza Why: A chance to meet 
graduate students from all across campus. See you this Friday! 

— Example-2: Hi guys, I don’t know whether there is an informal party or not 
although different people kept saying there might be one... So if there is 
nothing, we can go to Cozumel tonight cuz there will be a live Latin band 
tonight starting at 9:30pm. But if there is anything else, then let me know. 

Heterogeneity is an important property shared by many other problems, such as 
scene classification and handwritten digit recognition. However, typical studies 
in topic-based classification assume homogenous data and tight distributions^. 
Extending classification for high-variance heterogeneous data is an interesting 
topic that has not been investigated, and is the primary focus of this paper. 

Since the data are acquired from different sources and thus rather heteroge- 
neous, a single classification model might not be able to explain all the training 
data accurately. One apparent solution to this problem is to divide the heteroge- 
neous data into a set of relatively homogeneous partitions, train a classification 
model over each partition and combine the predictions of individual models. In 

^ The primary exception is Topic-Detection and Tracking (TDT) where multiple news 
sources are tracked in an on-line categorization task [1,25]. 
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this way, each sub-model captures only one aspect of the decision boundary. The 
idea of creating multiple models on the training data and combining the predic- 
tions of each model is essentially the ensemble approach, and there have been 
many studies on this subject. Several ensemble approaches have been success- 
fully applied to text classification tasks, including boosting [8] , Error-Correcting 
Output codes (ECOC) [6], hierarchical mixture model [22] and automated sur- 
vey coding [12]. Alternative approaches such as stacking [23] and earlier meta- 
classifier approaches [2] do not partition the data, but rather combine classifiers 
each of which attempts to classify all data over the entire category space. 

In this paper, we examine different ensemble methods for text classification. 
In particular, we propose an “input dependent latent variable” approach for 
model combination, which automatically directs each test example to the most 
appropriate classification model within the ensemble. We use this method as 
the framework to solve genre classification problems. Although our discussion 
is focused on multi-class classification framework, it is not difficult to extend to 
multi-label classification problems. The rest of the paper is organized as follows: 
in Section 2 we give an in-depth discussion of the popular ensemble approaches 
for topic-based text classification. Then we present our pairwise ensemble ap- 
proach in Section 3. We compare our method with other ensemble methods on 
four datasets, including one artificial dataset, two genre datasets and one topic- 
based classification data. Finally, we give conclusion and hint at future work. 

2 Popular Ensemble Approaches for Text Classification 

Generally speaking, an ensemble approach involves two stages, namely model 
generation and model combination. In this section, we examine the model gen- 
eration and model combination strategies in the popular ensemble approaches for 
the topic-based classification. Since genre classification also uses “bag of words” 
features, hopefully we can reuse some of the successful learning methods from 
topic classification to help genre classification. 

Bagging involves a “bootstrap” procedure for model generation: each model 
is generated over a subset of the training examples using random sample with re- 
placement (the sample size is equal to the size of the original training set). From 
a statistical point of view, this procedure asymptotically approximates the mod- 
els sampled from the Bayesian posterior distribution. The model combination 
strategy for bagging is majority vote. Simple as it is, this strategy can reduce 
variance when combined with model generation strategies. Previous studies on 
bagging have shown that it is effective in reducing classification errors [4]. 

Boosting As a general approach to improving the effectiveness of learn- 
ing, boosting [8] has been the subject of both theoretical analysis and practical 
applications. Unlike bagging, in which each model is generated independently, 
boosting forces the base classifier to focus on the misclassified examples in pre- 
vious iterations. In this way, each new model can compensate for the weakness 
of previous models and thus correct the inductive bias gradually [17]. Applying 
boosting to text categorization tasks, Schapire and Singer evaluated AdaBoost 
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on the benchmark corpus of Reuters news stories and obtained results compa- 
rable to Support Vector Machines and k-Nearest Neighbor methods [21], which 
are among the top classifiers for text classification evaluation [24,14]. Empirical 
studies on boosting and bagging show that while both approaches can substan- 
tially improve accuracy, boosting exhibits greater benefits [19,9]. Therefore, we 
provide only the results of boosting in our comparative experiments. 

ECOC is an ensemble approach for solving multiclass categorization prob- 
lems originally introduced by Dietterich and Bakiri[6]. It reduces a k-class clas- 
sification problem into L (L < fc) binary classification problems and combines 
the predictions of those L classifiers using the nearest codeword (for example, 
by Hamming distance). The code matrix R (an k x L matrix) defines how each 
sub-model is generated. There have been many code matrixes proposed, such as 
Dense matrix and BCH codes [20]. Recent work has demonstrated that ECOC 
offers improvement over the standard one- against- all method in text classifica- 
tion and provided theoretical evidence for the use of random codes [3,11]. 



3 Pairwise Ensemble Approaches 

From the discussion in section 2, we can see that most of those methods have 
complex model generation procedures and demonstrate considerable empirical 
improvement. However, they may not be the best choices for classification prob- 
lems with heterogeneous data for two reasons: 1) In order to capture the hetero- 
geneous characteristics of the data, it would be desirable to divide the training 
data into several relatively homogenous subsets. However, most algorithms do 
not intentionally do so. 2) The combination strategies are rather simple. To 
better solve the heterogenous classification problems, we propose the pairwise 
ensemble approach. The key idea of our algorithm is: 

— build pairwise classifiers to intentionally divide the training data into rela- 
tively less heterogeneous sets so that each base classifier focuses on only one 
aspect of the decision boundary; 

— combine the results using the “input-dependent latent variable” approach, 
which can consider the particular properties of each testing example and 
dynamically determine the appropriateness of each base classifiers. 

3.1 Model Generation by Pairwise Classification 

Since our data are quite heterogenous, it presents difficulties to the classical 
one- against- all method, which is implied in our experiment results in section 4. 
A natural idea would be applying pairwise classification method to discover the 
exact difference between each pair of genres and then combine the predictions 
of the individual classifiers. One big advantage of this approach is that each 
sub-classifier only need capture one local aspect of the training data while in the 
single model approach it has to fit all the aspects of the entire training data, 
which can average out important local distinctions. 

Building pairwise classifier for multi-class classification problems is not a new 
idea and many attempts have been made to build ensemble approaches, such 
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as ECOC [6], pairwise coupling [13], and round robin ensemble [10]. However, 
there has been little prior work on automatically combining individual pairwise 
classifier results in a meaningful way. 

3.2 A General “Latent Variable” Approach for Combining Models 

After the pairwise classifiers have been built, the remaining problem is how to 
combine the results. Linear combination methods, such as weighted voting, are 
inappropriate for the pairwise classification because each individual classifier 
only captures local information. One sub-classifier may be good for some exam- 
ples, but not necessarily for all the testing data. Thus, a better strategy is to 
build a set of “gates” on top of the individual models and ask the “gate” to tell 
whether the corresponding model is good at capturing the particular patterns 
of the input test data. We would call this “input-dependent latent variable” be- 
cause those gates can be thought as latent variables that determine the right 
models for each input data. Next, we give a formal description of this strategy. 

Given the input data x and a set of ensemble models M = {mi , m 2 , . . . , m„}, 
our goal is to compute the posterior probability P{y\x, M). As shown in Figure-1, 
each gate, i.e. hidden variable, is responsible for choosing whether its correspond- 
ing classifier should be used to classify the input pattern. More precisely, let hi 
stand for the hidden variable corresponding to the ith classification model; the 
value of hi can be 1 or 0, with 1 representing that the ith model is selected for 
classifying the input example and 0 otherwise. By using the hidden variables, 
we can expand the posterior probability as a sum as follows: 

P(y]x, M)= ^ P{y,hi = ki,h 2 = k 2 ,...,hn = kn\x,M). 

fci6{0,l} 

By assuming that the selection of a classification model is independent from the 
selection of another, we can simplify the joint probability as follows: 

P{y, hi = ki,h 2 = k 2 ,. ■ . ,hn = kn\x, M) 

n 

= = ki\x,M) X P{y\hi = = fc„,x,M). 

Consider building an exponential model with a set of features {In P( 2 /|x, mi), . . . , 
lnP(y]x,m„)}, then P{y\hi = = k„,x,M) = exp X]* In P(y]x, 

mi). To incorporate hi into the equation above, we set ai to with the intuition 
that the prediction of a model is given high weight if the model is suitable for 
the input pattern and low weight otherwise. In this way, we get 

1 " 

P{y\hi = ki,...,hn = kn,x,M) r; ^ P P'‘'(2/|x, m^). (1) 

i=l 

We drop the normalization factor Z, rewriting the previous equation as a pro- 
portionality, and then the joint probability can be derived as follows: 

n 

P{y,hi = ki,h 2 = /c 2 ,...,/i„ = fc„]x,M) oc PP(/ii = ki\x,M)P'^'{y\x,m^). 

2 = 1 
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Fig. 1. The Structure of the Latent Variable Approach. 

Therefore 

n 

P(y|x,M)cx E n 

n 

= ^ P{hi = ki\y.,M)P^*{y\y.,mj) 

fci£{0,l} 

n 

oc Y[{P{h^ = l|x,M)P(y|x,m^) -\- P{hi = 0|x,M)}. 

By assuming P{hi = 0|x,M) — >■ 1 we have 

n 

P{y\x, M) oc {P{hi = l|x, M)P{y\x, rrn) + 1}. 

i=l 

In this way we can further simplify by expanding only to the first order and ig- 
noring the high order terms that usually express the interaction between different 
models, which are usually very small in value . At last, we get the approximation: 

n 

P(?/|x, M) oc ^ P{hi = l|x, M)P(?/|x, rrii) (2) 

i=l 

As indicated in (2), there are two major components: P{hi = l|x, M), i.e. 
the component describing how likely it is the itn classifier should be used for 
classifying the input example, and P(y|x, Wi), i.e. the component determining 
the likelihood that class y is the true class label given the input x and the classifi- 
cation model rrii. At first glance, (2) looks very similar to the linear combination 
strategies except that the combination factor is P{hi = l|x, M). However, unlike 
the linear combination strategies whose combination weights is the same for all 
inputs, the weights in the latent variable approach are strongly connected with 
the input example by the conditional probability P{hi = l|x, M). 

^ During the developing process, we have made two assumptions ( (1) and this one). 
In section 4 we will show that our approach demonstrates signihcant improvement 
over other methods even with those simplifying assumptions. 
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Fig. 2. The Structure of the Pairwise Ensemble Approach for Class Ci. 



3.3 “Latent Variable” Approach for Pairwise Classifier 

Given the latent variable approach as the framework, the remaining problem 
is to estimate the conditional probability P{hi = l|x, M), i.e. the likelihood 
that the ith model should be used for class prediction given the input x. Since 
each individual classifier is a pairwise classifier to differentiate two classes, say 
Ci and Cj, a simple method to estimate P{hi = l|x,M) is to build a binary 
classifier on top of each base classifier to differentiate examples that belong to 
these two classes (Ci and Cj) and those that do not. The underlying idea is that 
the likelihood for a model to be used for classifying an input datum x is equal 
to the likelihood that x is similar with the examples to train the model. 

To make it more clear, let n be the number of classes and represent 

the pairwise classifier to differentiate class Ci and Cj. On the top level, we will 
have another classifier c \C^ differentiate whether the examples belong 
to one of classes Ci,Cj or not. Figure 2 shows the structure of the model for 
class Cl. Compared with (2), for the pairwise ensemble approach 

P{y = Ci|x, 771 h) = P{y = Ci|x,y e {Ci,CJ,mH), 

P{h, = l|x,Mi) = P{y € {Ci,CJ|x,Mi). 

For each class Ci, we build a structure like this and compute the corresponding 
score of the test examples. For multiclass problem, typical in genre classification, 
we can assign the test example to either the class label with the highest score 
(R-cut) or the classes whose scores are above some preset thresholds (S-cut) [26]. 
In this way, our approach can be extended to multi-label classification problems 
by assigning all the class labels above a given threshold to each test instance. 

3.4 Related Work and Discussion 

Our work is related to several approaches, including hierarchical mixture of ex- 
perts (HME) [15] and pairwise coupling [13]. HME uses similar ideas to dynam- 
ically determine the most appropriate model for testing examples. However, it 
requires much higher computational costs because it applies an EM algorithm 
to estimate the latent variables. Pairwise coupling also incurs high costs in the 
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Fig. 3. Comparison of time and accuracy of Different Ensemble Approaches. Sub- 
title lists method, accuracy and estimated running time Cl — >■ *{black),C2 — ^ 
n(Mue), C3 — ^ o(red). 



test phase due to its iterative search procedures. Therefore it would be difficult 
to directly apply those two methods to text classification problems. In order to 
provide a rough idea on the efficiency and effectiveness of those ensemble meth- 
ods, we followed the experiments in [13] and generated a synthetic dataset of 3 
classes with the data in each class generated from a mixture of Guassians. We 
use Linear Discriminative Analysis (LDA) as base classifiers. The results and 
the decision boundary are shown in Figure 3. From the results, we can see that 
pairwise ensemble and pairwise coupling are the best in terms of accuracy, and 
our method is much faster. 

4 Empirical Validation 

In our experiments we chose two datasets for genre-based classification evalua- 
tion. Collection I consists of 12,259 documents from 10 genres (for details see 
Table 1). We split the corpus into a training set of 9,236 documents and a test 
set of 3,023 documents. The Radio-news, TV-news and Newswire are part of the 
TDT2 [1] and we extracted documents from the same time period in order to 
ensure similar contents and thus minimize the information provided due to dif- 
ferent topics instead of different genres. The rest of the documents were collected 
from the web. Collection II, provided by Nigel Dewdney, consists of about 3,950 
documents from 17 genres (see Table 2 for details). We split the corpus into a 
training set of 3,000 documents and a test set of 950 documents. Compared with 
collection I, collection II contains more genres, but they are easier to distinguish. 
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Table 1. Document Distribution in Collection I. 



Genre 


Newswire 


Hadio 




Message 




FAQ 


ir'oiitics 


Bulletin 


Heview 


Search Result 


Number 


2DS2 


TSTTT 


1145 


11U6 


Togr 


TDB3 


999 


998 


996 


969 



Table 2. Document Distribution in Collection II. 



Genre 


Jokes 


Recipe 


Quotes 


Tips 


Newspages 


Advice 


Poetry 


Horoscopes 


Gonterence 


Number 


315 


302 


293 


270 


252 


243 


231 


223 


211 


Genre 


Resume 


Gompany 


Personal 


Interview 


Article 


Search 


Homepages 


Glassitied 


Number 


203 


202 


200 


201 


201 


201 


202 


200 



Table 3. Comparison of Results for Collection I. 



Method 


Micro-Avg Fi 


Macro-Avg Fi 


Error Rate 


One-against-all -\- SVM 


0.8757 


n/a 


0.8780 


n/a 


11.6% 


Pairwise Ensemble -I- SVM 


0.8965 


+2.3% 


0.9003 


+2.5% 


10.4% 


Boosting -1- SVM 


0.8695 


-0.7% 


0.8726 


-0.6% 


12.4% 


ECOC -t SVM 


0.8720 


-0.4% 


0.8758 


-0.3% 


12.8% 



Table 4. Comparison of Results for Collection II. 



Method 


Micro-Avg Fi 


Macro-Avg Fi 


Error Rate 


One-against-all + SVM 


0.9013 


n/a 


0.8755 


n/a 


9.1% 


Pairwise Ensemble + SVM 


0.9495 


+5.3% 


0.9432 


+7.7% 


5.1% 


Boosting + SVM 


0.8903 


-1.2% 


0.8620 


-1.5% 


9.9% 


ECOC + SVM 


0.9126 


+ 1.3% 


0.9026 


+3.1% 


8.7% 



We pre-processed the documents, including down-casing, tokenization, re- 
moval of punctuation and stop words, stemming and supervised statistical fea- 
tures selection using the criterion. The optimal feature set size was 

chosen separately by 10-fold cross validation. Finally we chose 14,000 features 
and 10,000 features for Collection I and Collection II, respectively. Document 
vectors based on these feature sets were computed using the SMART Itc version 
of TF-IDF term weighting [5]. For the evaluation metric, we used error rate and 
a common effectiveness measure, Fi, defined to be [26]: where Fi is 

the harmonic average of precision p and recall r. To measure overall effective- 
ness we use both the micro-average (effectiveness computed from the sum of 
per-category contingency tables) and the macro-average (unweighted average of 
effectiveness across all categories). 

4.1 Experimental Results 

In our experiments we used Support Vector Machines, one of the most powerful 
classifiers in previous text classification evaluation [26] , relying on the SV Mnght 
package [14] . Table 3 & 4 shows the results by different ensemble approaches and 
their improvement over the baseline on Collection I & II respectively. 

We use the result of SVM with linear kernel, without any ensemble methods 
as baseline. For boosting, we use the AdaBoost algorithm with SVM, tuned 
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Table 5. Comparison Results of Three Genres within the Same Topic {Fi measure). 





Newswire 


Radio- news 


TV-news 


Macro- Avg Fi 


One-against-all SVM 


0.9297 


0.7635 


0.7572 


0.8168 


Pairwise Ensemble -1- SVM 


0.9337 


0.7838 


0.8240 


0.8472 


Boosting-I-SVM 


0.9327 


0.7529 


0.7603 


0.8153 


ECOC+SVM 


0.8894 


0.7669 


0.8073 


0.8212 



for the optimal number of training iterations and report the best results (the 
corresponding training iteration is 10 and 5 respectively for collection I and II). 
For ECOC, we use SVM as the base classifier and apply a 63-bit random coding 
for both collections, which is also used in [11] for their experiments. SVM is used 
for both base classifiers and the top level gate classifier in pairwise ensemble. 

From the results, we can see the pairwise ensemble approach performs consis- 
tently the best among the four methods in terms of error minimization and both 
Micro- Fi and Macro- Fi measurement. Boosting SVM decreased performance 
over the baseline for both collections, which is a sign of overfitting. In fact, we 
tried boosting other classifiers, such as Decision Tree and Naive Bayes. Although 
boosting SVM deteriorates the performance, it gives the best result compared to 
other boosted classifiers. ECOC method achieved some improvement on Collec- 
tion II but decreased the performance for the other collection. This implies that 
ECOC is not a generally effective method to improve classification accuracy. 

In order to more clearly evaluate automated genre classification, independent 
of topic classification, we listed in Table 5 the detailed results of three genres 
in Collection I, i.e., Newswire, Radio-news and TV-news, which have been in- 
tentionally collected on the same topics to minimize the information provided 
due to different topics instead of different genres. From the results, we can see 
that the performance on those three categories is much lower than the average 
result over the whole collection with ten categories. This implies that it is more 
difficult and challenging to distinguish genre within the same topic. On the other 
hand, our approach achieves the best performance on all the three categories, 
especially for TV-news, which improves abut 9% over the baseline in F\ measure. 

4.2 Extension for Topic-Based Text Classification 

We have shown that the pairwise ensemble approach is effective to improve the 
performance of genre classification. Since we use only word features for genre 
classification, which is the same with the topic-based classification, it is a natural 
question to ask whether our method is also good for topic-based text classifica- 
tion. To answer the question, we tested our methods on the Newsgroups dataset 
[18], one of commonly used datasets for text classification. The dataset contains 
19,997 documents evenly distributed across 20 classes. We used the cleaned-up 
version of the dataset^, removed stop words as well as the words that occur only 

^ This cleaned-up version is downloaded from 
http:/ /www. ai.mit.edu/people/jrennie/ecoc_svm 
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Table 6. Comparison of Results for 20Newsgroup. 



Method 


Micro-Avg Fi 


Macro-Avg Fi 


Error Rate 


One-against-all + SVM 


0.9009 


n/a 


0.8941 


n/a 


9.3% 


Pairwise Ensemble -I- SVM 


0.9333 


+3.6% 


0.9257 


+3.5% 


6.8% 


Boosting + SVM 


0.9020 


+0.1% 


0.8967 


+0.3% 


9.1% 


ECOC -b SVM 


0.9159 


+ 1.7% 


0.9025 


+1.0% 


8.4% 



once, with the final vocabulary size being about 60,000. We randomly select 80% 
of the documents per class for training and the remaining 20% for testing (15199 
training documents and 3628 test documents ). This is the same pre-processing 
and splitting as in the McCallum and Nigam experiments [18]. 

Table 6 lists the results of comparing different ensemble approaches on the 
newsgroup dataset. For boosting, the training iteration was 10 by 
cross-validation and all other parameter settings are the same with previous 
experiments. The results imply that the pairwise ensemble approach works well 
for this text classification dataset, in fact significantly better that baseline SVM 
or boosting SVM. 

5 Conclusion and Future Work 

In this paper, we identified the heterogeneity of genre data and presented our 
new pairwise ensemble approach to capture this characteristic. Empirical stud- 
ies on two genre datasets and one topic-based datasets show that our method 
achieved the best performance among all the popular ensemble approaches we 
have tried, including boosting and ECOC. However, is pairwise ensemble truly 
dominant in general? Answering that question would require much larger scale 
empirical studies, but is definitely an important issue. Another line of research 
involves discovering the limitations of pairwise ensemble, such as the compu- 
tational tractability as the category space grows and potential paucity of data 
to train all pairwise classifiers. One solution would be selecting only category 
pairs with sufficient training data and smoothing the decisions via the baseline 
classifier. Empirical validation for these extensions would be a natural next step. 
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Abstract. Most of multi-agent reinforcement learning algorithms aim to con- 
verge to a Nash equilibrium, but a Nash equilibrium does not necessarily mean 
a desirable result. On the other hand, there are several methods aiming to depart 
from unfavorable Nash equilibria, but they are effective only in limited games. 
Based on them, the authors proposed an agent learning appropriate actions in 
PD-like and non-PD-like games through self-evaluations in a previous paper [11]. 
However, the experiments we had conducted were static ones in which there was 
only one state. The versatility for PD-like and non-PD-like games is indispens- 
able in dynamic environments in which there exist several states transferring one 
after another in a trial. Therefore, we have conducted new experiments in each of 
which the agents played a game having multiple states. The experiments include 
two kinds of game; the one notifies the agents of the current state and the other 
does not. We report the results in this paper. 



1 Introduction 

We investigate the use of reinforcement learning in a multi-agent environment. Many 
multi-agent reinforcement learning algorithms have been proposed to date [6, 4, 2, 13, 
1]. Almost all of them aim to converge to a combination of actions called Nash equi- 
librium. In game theory, a Nash equilibrium is a combination of actions of rational 
players. However, this combination is not optimal in some games such as the prisoner’s 
dilemma (PD) [ 14] . There are, on the other hand, several methods aiming to depart from 
undesirable Nash equilibria and proceed to a better combination by handling rewards 
in PD-like games [7, 8, 23]. However, since they use fixed handling methods, they are 
inferior to normal reinforcement learning in non-PD-like games. 

In our previous paper [II], we have constructed an agent learning appropriate ac- 
tions in both PD-like and non-PD-like games through self-evaluations. The agent has 
two conditions for judging whether the game is like PD or not and two self-evaluation 
generators — one generates self-evaluations effective in PD-like games and the other 
generates them effective in non-PD-like games. The agent selects one of the two genera- 
tors according to the judgement and generates a self-evaluation for learning. We showed 
results of experiments in several iterated games and concluded that the proposed method 
was effective in both PD-like and non-PD-like games. 

N. Lavrac et al. (Eds.): ECML 2003, LNAI 2837, pp. 289-300, 2003. 
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However, the experiments were static ones in which there was only one state. The 
versatility for PD-like and non-PD-like games is indispensable in dynamic environ- 
ments in which there exist several states transferring one after another in a trial. There- 
fore, we have conducted new experiments in each of which the agents played a game 
having multiple states. The experiments include two kinds of game; the one notifies the 
agents of the current state and the other does not. We report the results in this paper. 

This paper consists of six sections. In Section 2, we introduce the generators and the 
conditions proposed in the previous paper. We show in Section 3 the new experiments 
that we have conducted in two kinds of game played by the agents. In Section 4, we 
discuss the result of the experiments and of this study itself. Related works are shown 
in Section 5. Finally, we conclude the paper and point out future works in Section 6. 

2 Generating Self-evaluation 

This section introduces our method proposed in the previous paper [11]. 

2.1 Background and Objectives 

In game theory, an actor is called a player. A player maximizes his own payoff and 
assumes that all other players do similarly. A player i has a set of actions Z, and his 
strategy cr, is defined by a probability distribution overiT, . When cr, assigns probability 
1 to an action, cr, is called a pure strategy and we refer to it as the action, cr = (cr/) is 
a vector of strategies of players in a game, and cr_, is a vector of strategies of players 
excluding i. A payoff of player i is defined as //(cr) s fiicTi, cr^i). Then the best response 
of player i for cr_, is a strategy cr, satisfying 

= max/j(T/,cr_,). 

Ti 

The vector cr is a Nash equilibrium if, for all i, cri is the best response for cr_,. On the 
other hand, the vector cr is Pareto optimal if there is no vector p satisfying 

Vi fi(p) > fi((T) and 3j fj{p) > fj{(r). 

It means that there is no combination of strategies in which someone gets more payoff 
than cr without those who get less. 

Generally, a Nash equilibrium is not Pareto optimal. Table 1 shows a game having 
only a single Nash equilibrium, which is not Pareto optimal. This game is an example of 
the prisoner’s dilemma (PD) [14]. In this game, since the best response of a player is D 
regardless of the other player’s action, there is only one Nash equilibrium cr = (D, D)\ 
but it is not Pareto optimal because (C, C) is the role of p in the definition. 

In reinforcement learning, an actor is called an agent and it learns from cycles of 
action selection and reward acquisition [18]. Although reinforcement learning is for 
a single agent environment, there are many proposals to extend it for a multi-agent 
environment [6,4,2, 13, 1]. However, almost all of them aim to converge to a Nash 
equilibrium without considering Pareto optimality. Hence, in a PD game of Table 1, 
they will converge to an unfavorable result cr = (D, D) purposely. 
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Table 1. Prisoner’s Dilemma [14], Player A selects a row and B selects a column. Each takes 
cooperation C or defection D. (x,y) refers to the payoff of A and B, respectively 



A \ B 


C D 


C 


(2,2) (0,3) 


D 


(3,0) (1,1) 



On the other hand, there are several proposals to have the combination of actions 
depart from undesirable Nash equilibria and proceed to a better combination in PD-like 
games through reinforcement learning with reward handling methods [7, 8, 23]. How- 
ever, since the methods of these proposals are fixed and applied unconditionally, the 
methods are inferior to normal reinforcement learning in non-PD-like games. Hence, 
when we equip agents with these methods, we have to check the game in advance. 
These cannot also be used in an environment changing from game to game. 

Hence, in the previous paper [1 1], we have constructed an agent learning appropri- 
ate actions in both PD-like and non-PD-like games. The agent has two reward handling 
methods and two conditions for judging the game. We call the handled rewards self- 
evaluations and the reward handling methods self-evaluation generators. Each genera- 
tor generates self-evaluations appropriate in either PD-like or non-PD-like games, and 
the conditions judge whether the game is like PD or not. In each learning cycle, the 
agent judges the game through the conditions, selects one of the two generators accord- 
ing to the judgement, generates a self-evaluation, and learns through the evaluation. 

Before introducing the generators and the conditions in detail, we classify games in 
the next subsection. 

2.2 Classification of Symmetric Games 

We classify symmetric games into four classes in terms of game theory. A symmetric 
game is that in which all players have a common set of actions 27 and a common payoff 
function /. A Nash equilibrium which consists only of pure strategies is called a pure 
strategy Nash equilibrium (PSNE), and a PSNE in which all players’ actions are identi- 
cal is called a symmetric PSNE (SPSNE). Here, we classify a symmetric game into one 
of the following class by a set of SPSNEs N and a set of Pareto optimal combinations 
of actions P of the game. 

Independent: N + N c P 

All SPSNEs are Pareto optimal. 

Boggy: N N CtP - % 

None of SPSNEs are Pareto optimal. 

Selective: N N n P N P 

There are Pareto optimal SPSNEs and non-Pareto optimal SPSNEs. 

Rival: N - d 

There is no SPSNE. 

Eor a game in the independent class, the combination of actions is desirable when 
all players select rational actions independently. Conversely, the combination is unfa- 
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vorable for a game in the boggy class. In the game, the players actually get less if all 
of them select the more profitable action. This is the origin of the name “boggy”. PD is 
in this class. In a game in the selective class, it is a problem which SPSNE is desirable. 
The rival class consists of games having some gainers and some losers (e.g. a game in 
which a player has to yield a way to another). In the two-person two-action case, the 
independent and the boggy classes are both in Categories I and IV^ of Weibull [21], and 
the selective and the rival classes are in Categories II and III, respectively. 



2.3 Generating Self-evaluation 

In this paper, we use Q-learning [20] that is a representative reinforcement learning 
method. Q-learning updates the Q-function representing estimates of future rewards 
from each cycle of action selection and reward acquisition. At time f, an agent recog- 
nizes the current state St, takes an action a,, obtains a reward and recognizes the 
next state jf+i. Afterwards, Q-learning updates the Q-function Q, as follows. 

Qt{s,a) - Qt-i(s,a) if v i; or a a,, 

Qt(s,,at) = Q,-i(s,,a,) + a6i. 

In the update rule, d, is a temporal difference (TD) error: 



6ttr,+i +ymaxQt-i(s,+ua) - Qt-i(s,,at). (2) 

a 

The agent selects an action using the Q-function with a randomizing method, e.g. the 
softmax method [18], that adds some randomness to the selection. 

We now introduce two self-evaluation generators. In an agent A, at time f, a self- 
evaluation is generated by adding a term Ai ,+i to a reward r, ,+i, which is then used 
to update the Q-function. We omit the subscript i showing “the agent itself (A,)” in the 
following. 

^ n+i + Ai+i. (3) 

We propose two T’s, which we call the neighbor's rewards (NR) and the difference of 
rewards (DR). NR is effective for a game in the boggy class and DR is effective for a 
game in the independent class. 



A^eNMi 

^ - r, (5) 

In Formula 4, A(\A, is a set of agents that excludes A, from a set of A,’s neighbors, 
N[. NR is effective in a game in which the neighbors’ rewards decrease as the agent se- 
lects profitable actions. Since DR emphasizes the difference between the present reward 
and the last reward, it makes the agent sensitive to change of rewards and the agent has 
a tendency to take self-interested actions. 

* Categories I and IV are the same except for action names. 
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Table 2. Auto-Select (AS-) AA, AR, and RR: Q-functions for the judgement 



AS- 


Formula 6 


Formula 7 


AA 


Qac, 


Qac, 


AR 


Qac, 


Qrecog 


RR 


Qrecog 


Qrecog 



Although NR and DR are effective in games in the boggy class and in the indepen- 
dent class, respectively, they are harmful when their use is reversed because the two 
classes are the opposite. Therefore, we have to devise how the agent appropriately se- 
lects the two /Ts according to the game. In the previous paper, we have proposed two 
conditions by the following formulae forjudging the class of game and selecting A. 

Qi-i(s,, a) <0 for all a, (6) 

n+i < gf-i(5f,flf)-rmaxg/-i(5f+i,a)- (7) 

a 

Formula 6 means that there is no hope whatever action the agent currently selects. For- 
mula 7 is derived from the TD error (Formula 2). This formula shows that the current 
situation is worse than what the agent learned, because it shows that the present reward 
is less than the estimate calculated from the learned Q-function on the assumption that 
the TD error is 0, i.e. the learning is completed. We can think that the agents should re- 
frain from taking self-interested actions that bring such worse situation. Thus, we make 
a rule that NR is used as a self-evaluation generator if at least one of these formulae is 
satisfied; DR is used otherwise. We call this rule auto-select (AS). 

In Formula 7, the left-hand side is the present reward (r). It becomes a subject of 
discussion because, in this paper, the Q-function is learned by self-evaluations (r'), not 
by rewards. Accordingly, we also introduce a normal Q-function for the judgement. We 
call this normal Q-function because it is used for recognizing games, and refer to 
the Q-function learned by self-evaluations as because it is used for action selection. 
Since the discussion is not concerned with Formula 6, we are able to introduce two types 
of agent; one uses Q‘“^' and the other uses in Formula 6. We call them AS-AR and 
AS-RR, respectively. We also introduce AS-AA that uses Q‘“^' in both formulae [10]; it 
can be thought to substitute rewards for self-evaluations in Formula 7. Table 2 shows 
the relation of ASs. 

In summary, the proposed agents AS-AA, AS-AR, and AS-RR learn actions in the 
following cycle: 

1. Sense the current state jf. 

2. Select and take an action at by with a randomizing method. 

3. Obtain a reward r,+i and sense the next state j,+i. 

4. Recognize the game by Formulae 6 and 7, then select At+i. 

5. Generate a self-evaluation from r,+i and At+i by Formula 3. 

6. Update by Q-learning with s,, at, S/+i and by Q-learning with 

St, at, rt+i, 5f+i- 

7. Back to 1. 
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Fig. 1. Narrow Road Game: Black cars are parking on the road. Two white cars simultaneously 
appear at both side of the road 



3 Experiments 

We have conducted three experiments using two kinds of game having multiple states 
played by multiple homogeneous agents. One kind is called Markov games or stochastic 
games [6, 4] in which each agent is notified of the current state, and the other kind is 
state-unknown games in which each agent is not informed of the state. We used seven 
types of agents for comparison: random (RD), normal (NM), the neighbors’ rewards 
(NR), the difference of rewards (DR), and the three types of auto-select (AS-AA, AS- 
AR, and AS-RR). RD selects actions randomly, and NM learns by normal Q-learning. 

Each experiment was conducted twenty-five times. Q-functions were initialized to 
0 in each trial. We used the softmax method [18] as a randomizing method for action 
selection and a parameter of the method called temperature was set to 1 . Learning rate 
a and discount factor y in Formula 1 were both set to 0.5. 

3.1 Markov Game 

A Markov game [6, 4] has several matrix games each of which has a label called state. 
In the game, the combination of all player’s actions makes a state transfer to another 
and every player knows which state he is in now. 

Here we use a game named Narrow Road Game. Suppose a road with two parking 
cars and two cars which simultaneously appear at both sides of the road (Figure 1). 
Each car selects one of actions: GO or WAIT. Both cars want to pass the narrow point 
as soon as possible. However, if both take GO, they cannot pass because of the point. If 
both take WAIT, on the other hand, nothing changes. 

Each car (agent) receives a reward 1 when it takes GO and succeeds in passing and 
-0.5 when it fails to pass or it takes WAIT. A cycle is finished after both agents have 
passed or they have taken four actions. Every agent learns after every reward acquisition 
and knows whether the opposite remains and how many actions they have taken. After 
one agent succeeds in passing, it gets out of the cycle, and the remaining agent takes 
an action again. Then if the remaining agent takes GO, it passes unconditionally, and 
otherwise, it has to take an action again. These are summarized in Figure 2. Since payoff 
matrices (i.e. states) change with the number of agents and every agent knows which 
state it is in now, this is a Markov game. The set of neighbors, A, , in Formula 4 includes 
all agents in a game. 

Figure 3 shows the average of summed reward of two agents at the 100th cycle 
in 25 trials. In each cycle, the rewards each agent obtains are accumulated and two 
agents’ accumulated rewards are summed. The maximum summed reward is 1.5 when 
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Fig. 2. State Transition in Narrow Road Game: A circle and an arrow show a state and a state 
transition, respectively. A number in a circle shows how many agents are remaining. Black arrows 
show transitions in the case in which one of the agents succeeds in passing, and gray ones show 
transitions in other cases. Dashed arrows show transitions by one remaining agent. Numbers with 
an arrow show payoff with each transition. With a solid black arrow, the upper number is payoff 
for the going agent and the lower is payoff for the waiting agent 




Agent Type 



Fig. 3. Result of Narrow Road Game: It shows the average of summed reward of two agents at 
the 100th cycle in 25 trials 



one agent passes at the first action phase (i.e. the other waits) and the other passes at the 
second, and the minimum is -4 when both fail to pass in four action phases. 

From the hgure, we can see the following: 

- The three AS versions outperform the other methods. However, as a result of paired 
t-tests, there is no significant difference between the result of AS-AR and those of 
other methods. AS-AA and AS-RR are also not signihcantly different from NR. 

- The summed reward about 1 in AS-AA and AS-RR means that the probability of 
taking more than two action phases is less than a half. 



3.2 State-Unknown Games 

The previous subsection shows the experimental result of a Markov game. In a Markov 
game, every agent is able to know which state it is in now. However, in a real situation, 
we cannot precisely know the current state in many cases and we have to guess it. 
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Therefore, in this subsection, we show the results of two experiments in state-unknown 
games. 

We used the tragedy of the commons (TC) type games. TC is introduced by Hardin 
[3], in which there is an open pasture with several rational herdsmen each of whom 
grazes cattle in the pasture. Since each herdsman tries to maximize his profit, each 
one brings as many cattle as possible to the pasture. If all herdsmen bring their cattle 
unlimitedly, however, it is overgrazing and the pasture goes wild. This is the tragedy. 

Mikami et al. [7] conducted an experiment consisting of ten agents. In a cycle, each 
agent takes one of three actions {selfish, cooperative, and altruistic) simultaneously and 
receives a reward. An action a, of an agent A, {i - 0, 1, ..., 9) follows a base reward 
/?(«,) and a social cost C(a,). R is 3, 1, and -3 and C is H-1, +0, and -1 when the agent 
takes a selfish, a cooperative, and an altruistic action, respectively. After all agents take 
actions. A, receives a reward r, defined as 

n 4 R(ai) - 2 C{aj). 

J 

The set of A/’s neighbors, Nj, in Formula 4 is defined as 

Ni ^[Ak\k^ {i + j) mod 10, 7 = 0, 1, 2, 3}, 

and A, uses the combination of actions of A, as a state in Q-learning. A, is able to know 
only the combination, not all agents’ actions. 

In our previous paper [I I], we modified the cost C as +c, +0, and -c when the agent 
took a selfish, a cooperative, and an altruistic action, respectively, c was a constant 
common for all agents and the game was identical with Mikami’s when c = 1. We 
showed the results of two experiments, c - 1 and c = 0, which were in the boggy class 
and in the independent class, respectively, and concluded that the proposed methods 
(ASs) were effective in both games. 

In this paper, the cost parameter c is changed without informing agents in each 
trial. Since the results in the previous paper were at the 3000th cycle, we changed the 
parameter c in a trial after 3000 cycles and continued the trial further 3000 cycles. 
We have conducted two experiments; the parameter c is changed from 1 (boggy) to 0 
(independent) and vice versa. We show the results at the 6000th cycle (i.e. the 3000th 
cycle after the change) with the previous results for comparison. We look at the results 
of the experiments with those of Mikami’s average filter (AF)^ [7] and of our past 
method called the general filter (GF) [9]. 

Figure 4 shows the average of summed reward of ten agents at the 6000th cycle 
in 25 trials. From the figure we can see thaf, in the experiment from c = 1 (boggy) 
to 0 (independent), the results are considerably different from the previous static ones, 
especially those of ASs. On the other hand, the results are similar in the experiment 
from c = 0 to 1 . 



^ They proposed two types of AF; but due to space constraint, only the result of type 1 is shown 
here. 
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RD NM NR DR AS-AA AS-AR AS-RR AF QF 
Agent Type 



(a) c = 1 ^ 0 (b) c = 0 ^ 1 

Fig. 4. Result of the Tragedy of the Commons: Each thick, black bar shows the average of 
summed reward of ten agents at the 6000th cycle (i.e. the 3000th cycle after change) in 25 trials. 
Each thin, gray bar shows the result at the 3000th cycle in 25 trials of a previous static experiment 
with changed c, i.e. 0 in (a) and 1 in (b) [11] 



4 Discussion 



In Narrow Road Game, ASs are the best of all. It seems slightly strange because ASs 
only chose NR or DR that were worse than ASs, but we can be convinced if we interpret 
as the game requiring agents to use NR and DR properly. In fact, when two agents are 
remaining, which is in the rival class, one agent has to give a way to the other and NR 
may be effective in this class. On the other hand, when only one is remaining, which is 
in the independent class, it has to go decisively and DR is effective in this class. 

In state-unknown games, although the result of ‘0 to 1’ is not affected by the change, 
the result of ‘ 1 to 0’ is worse than that of c = 0. It is because an agent’s policy learned 
before the change had deleterious effect on its actions after the change. Since c = 1 
is in the boggy class, the agent mainly learned through NR before the change. Hence, 
the agent’s Q°‘^‘ function for altruistic actions must have been more than that for self- 
ish actions. The function for cooperative actions may also have been. On the other 
hand, not only selfish actions, but also cooperative ones bring positive rewards after the 
change in this game. Therefore, even if altruistic actions were withdrawn appropriately 
by learning after the change, the action the agent would choose may have been not the 
selfish one, which is desirable in c = 0, but the cooperative one. Consequently, if the 
agent learns through same Q-functions before and after the change, we cannot avoid 
this problem even if we humans choose NR and DR properly. Thus, in order to learn 
appropriate actions in these games, what we have to improve is, probably, not the con- 
ditions for selecting self-evaluation generators, but the selected generators, i.e. NR and 
DR. 

We have constructed an agent which is able to learn suitable actions in several 
games by means of Q-learning with self-evaluations instead of rewards. As far as we 
know, no such work yet exists in a reinforcement learning context. There are indeed 
some works which handles reward functions for learning suitable actions [7, 8, 23], but 
they differ from this work because each of them performs for only one game. If we use 
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their works for constructing agents, we have to decide in advance whether we are to use 
them according to the game. If we use this work, on the other hand, it will surely reduce 
problem of judging the game. 

This work aims to discover suitable conditions or meta-rules for learning. In this 
paper, we introduce two conditions for judging the boggy game. Although we can in- 
troduce other meta-rules, it will be difficult. Thus, we should devise methods for con- 
structing meta-rules themselves. It becomes a meta-learning problem [19] in a learning 
context. We can also search function space of self-evaluations by genetic algorithm. 

5 Related Works 

Although there are only a few works which handle rewards for learning suitable actions 
in a reinforcement learning context, there are several methods in a genetic algorithm 
(GA) context. Mundhe et al. [12] proposed several fitness functions for the tragedy of 
the commons. Their proposed fitness functions are used if a received reward is less than 
the best of the past; otherwise, normal functions are used. Thus, unlike Mikami’s [7] 
and Wolpert’s [23], they are conditionally used but only applied to games in the boggy 
class. 

There are a few works aiming to obtain meta-rules. Schmidhuber et al. [17] pro- 
posed a reinforcement learning algorithm called the success story algorithm (SSA). It 
learns when it evaluates actions in addition to normal actions. Zhao et al. [24] used the 
algorithm in a game in which agents have to cooperate to gain food, as they escape 
from a pacman. In a GA context, Ishida et al. [5] proposed agents each of which had 
its own fitness function called norm. If the agent’s accumulated reward becomes under 
a threshold, the agent dies and a new one having a new norm takes the agent’s place. 
They conducted only an experiment in the tragedy of the commons. 

Uninformed changes in state-unknown games are similar to concept drifts in a sym- 
bolic learning context, which are changes of concepts driven by some hidden contexts. 
A learner in the environment having concept drifts usually uses a kind of windowing 
method to cast off expired inputs. Widmer et al. [22] presented a flexible windowing 
method which modified fhe window size according fo the accuracy of learned concepts 
to the input sequence in order to follow the concept drifts more flexibly. They also pro- 
posed a learner that reuses some old concepts learned before. Sakaguchi et al. [16] pro- 
posed a reinforcement learning method having the forgetting and reusing property. In 
their method, if the TD error is over a threshold, a learner shifts the current Q-function 
to an adequate one it has or creates a new Q-function if it does not have. They conducted 
experiments only in single agent environments. 

In game theory, each player takes defection in the prisoner’s dilemma (PD). How- 
ever, when humans play the game, the result is different from that of theories. Rilling 
et al. [15] requested thirty-six women to play PD and watched their brains by func- 
tional Magnetic Resonance Imaging (fMRI). They reported that several parts of reward 
processing in their brains were activated when both players cooperated, and explained 
that rewards for cooperation were then generated in their brains. This shows that there 
are common features in a human’s brain processes and in the proposed method of this 
work. 
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6 Conclusion 

Most of multi-agent reinforcement learning methods aim to converge to a Nash equi- 
librium, but a Nash equilibrium does not necessarily mean a desirable result. On the 
other hand, there are several methods aiming to depart from unfavorable Nash equilib- 
ria, but they are effective only in limited games. Hence, we have constructed an agent 
learning appropriate actions in many games through reinforcement learning with self- 
evaluations. 

First we defined a symmetric pure strategy Nash equilibrium (SPSNE) and catego- 
rized symmetric games into four classes — independent, boggy, selective, and rival — 
by SPSNE and Pareto optimality. Then we introduced two self-evaluation generators 
and two conditions for judging games in the boggy class. We proposed three types of 
methods, which were auto-select (AS-) AA, AR, and RR. 

We have conducted experiments in Narrow Road Game and state-unknown games. 
The result of Narrow Road Game shows that the versatility for PD-like and non-PD- 
like games is indispensable in games having multiple states. In ‘0 to 1’ state-unknown 
game, there is no effect from the change, thus the proposed methods (ASs) seem robust 
to multiple states. On the other hand, the result of ‘ 1 to 0’ state-unknown game shows 
that there exist games in which ASs are fragile. However, discussion about this ineffec- 
tiveness in Section 4 points out that the fragility is derived from not the conditions for 
selecting self-evaluation generators, but the selected generators. 

We are able to point out several future works. First, since we have conducted only 
empirical evaluation in this paper, we have to evaluate the proposed methods theoret- 
ically. Next, since we have classified only symmetric games, we need to extend the 
class definition so that it can deal with asymmetric games. Furthermore, we conducted 
experiments only in simple games with homogeneous agents, and thus, we have to con- 
duct experiments in more complex problems which consist of heterogeneous agents and 
evaluate the proposed methods by the results. 
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Abstract. Recently, ranking and sorting problems have attracted the attention 
of researchers in the machine learning community. By ranking, we refer to cate- 
gorizing examples into one of K categories. On the other hand, sorting refers to 
coming up with the ordering of the data that agrees with some ground truth prefer- 
ence function. As against standard approaches of treating ranking as a multiclass 
classification problem, in this paper we argue that ranking/sorting problems can 
be solved by exploiting the inherent structure present in data. We present effi- 
cient formulations that enable the use of standard binary classification algorithms 
to solve these problems, however the structure is still captured in our formula- 
tions. We further show that our approach subsumes the various approaches that 
were developed in the past. We evaluate our algorithm on both synthetic datasets 
and for a real world image processing problem. The results obtained demonstrate 
the superiority of our algorithm over multiclass classification and other similar 
approaches for ranking/sorting data. 

1 Introduction 

Consider the problem of ranking movies. The goal is to predict if a movie fan will 
like a certain movie or not. One can probably extract a number of features like - actors 
playing the lead role, setting of the movie (urban or rural), type of movie (romantic or a 
horror movie etc). Based on this, a set of possible ratings for a movie could be - run-to- 
see, very-good, good, only-if-you-must, do-not-bother, and avoid-at-all-cost. At another 
time, given a list of movies one might like to obtain a ordered list where the order in the 
list represents his preferences or choices. Here we observe that, in both these problems, 
the labels themselves do not carry much meaning. The information is captured by the 
relative ranks or order given to the movies. Henceforth, we refer to the first problem as 
ranking problem and the second problem as sorting problem. 

Most of the research in the machine learning community has gone into developing 
algorithms which are good for either classification or regression. Given that, some of 
the researchers have attempted to solve the ranking and sorting problems by posing 
them as either multiclass classification or regression problems. In this paper, we argue 
that this is not the right approach. If the ranking problem is posed as a classification 
problem then the inherent structure present in ranked data is not made use of and hence 
generalization ability of such classifiers is severely limited. On the other hand, posing 
the task of sorting as a regression problem leads to a highly constrained problem. 

N. Lavrac et al. (Eds.): ECML 2003, LNAI 2837, pp. 301-312, 2003. 
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This paper starts with the theoretical analysis relating the problem of sorting with 
classification. We show that the VC-dimension of a linear classifier is directly related to 
the rank-dimension (defined later) of a linear classifier. We use the intuition developed 
by studying the complexity of sorting/ranking problems to show how these can be re- 
duced to a standard classification problem. This reduction enables use of both powerful 
batch learning algorithms like SVM and online classification algorithms like Winnow 
and Perceptron. The online version of our algorithm subsumes the previously proposed 
algorithm Pranking [2] and at the same time is much easier to implement. Further, in 
many ranking problems, because of the users preference, only few features may be ac- 
tive at any given time and as such Winnow [9] is a better choice than Perceptron. We 
further extend the results presented in the paper to handle the case when it may not be 
possible to learn a linear ranker in the original data space. In particular, we make use 
of kernel methods and show how one can learn a non linear ranker for both the ranking 
and sorting problems. This formulation is similar in spirit to the work by [4], however 
in our approach there is a significant reduction in the computational complexity. 

At this point, we would like to note that the basic problem formulation presented 
in this paper is different from the formulation that has been assumed in algorithms 
like Cranking [7]. The model adopted in the latter is specifically meant to learn the 
sorting of the data based on the ordering of the data given by other experts. The problem 
considered in this paper however deals with the case when we actually do not have the 
orderings from individual experts but instead have some feature based representation of 
the data. For example, given some documents/images that need to be sorted or ranked, 
we will extract the features from these documents/images and learn a ranker on top of 
these. This is the view which has also been adopted in [2, 4]. 

Recently, there has also been work to solve the problem of multiclass classification 
by considering ordering information in class attributes, thereby posing it as a ranking 
problem. In [3], Frank et al. transform the data in such a way that the fc-class ordinal 
classification problem reduces to fc — 1 binary class problems. Krammer et al. [13] use 
a modification of S-CART (Structural Classification And Regression Trees) to perform 
ordinal classification. The algorithm that we propose in this paper, can be easily adapted 
to solve the problem of multiclass classification if the ordering of the classes are known. 

The organization of the paper is as follows. We start off by explaining some no- 
tations and give the formal definition of the problem. We will also explain the model 
adopted to solve the problem. In Sec. 3, we give the relationship between the rank- 
dimension and VC-dimension of a linear classifier. Sec. 4 gives the main result of the 
paper. In this section, we show how one can reduce the ranking and sorting problems to 
a classification problem. Based on the results from this section, we show how one can 
use SVM and other kernel algorithms in Sec. 5. Finally, in Sec. 6 we give results on both 
synthetic data and test our ranking algorithm on a novel image processing problem. 



2 Notations and Problem Definition 

Consider a training sample of size m say S = {{x\,yi), (X 2 , 2 / 2 ), G 

X,yi & y where X is the domain from which each training example comes and y is 
the space from which labels are assigned to each example. The labels yi are referred to 
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as the ranks given to the examples. As mentioned earlier, in this paper we have adopted 
a feature based representation of the data and we assume that A” is n dimensional space 
of reals 5ft". Under this, for any Xi, Xj C A” we have Xi — Xj G X. 

For ranking problem, y = {1, ...,K} where K is the maximum rank that can be 
taken by any example. This is similar to the multiclass classification problem. However 
the spirit of the ranking problem is very different. The ranks relate to the degree of 
interest/confidence a person has in that instance. Given an example with rank fc, all 
the examples with rank less than k are less interesting and all the examples with rank 
more than k are more interesting. Such a relationship/viewpoint does not exist in case 
of multi-class classification problem. In general we will assume that K (the maximum 
rank) is fixed for a given problem and will treat the case when K is same as the size of 
the dataset separately. We refer to this latter case as the sorting problem. In the sorting 
problem, each training example has a distinct rank (this rank will be referred to as the 
order in the dataset) which refers to the relative position of that instance in the dataset. 
In this case, one cannot treat it as a multiclass problem as the number of classes depend 
on the size of the data and the interest is in learning a function that can give relative 
ordering instead of absolute class labels. 

2.1 The Ranking Model 

In this paper, we adopt a functional approach to solve the ranking problem. Given a 
data point a; or a set of data points S, we learn a ranker f : X ^ y. Depending upon 
whether K is fixed or it is m, our choice of function / will be different. In general, when 
K < m is fixed, we will refer to it as the ranking problem where as when K = m, we 
will refer to it as the sorting problem. 

In this paper, we assume that there exists an axis in some space such that when 
data is projected on to this axis, the relative position of the data points captures the 
model of user preferences. In the case of the sorting problem, we will treat / to be 
a linear function whose value is the signed distance from some hyperplane h. That 
is f{xi) = The information about the relative order of the datapoints will be 

captured by the distance from the hyperplane. During the testing phase, given a set of 
datapoints, the sorted set of datapoints with respect to their order will be obtained by 
sorting them with respect to the raw value f{x). 

We will adopt a similar framework to solve the ranking problem. However, in this 
case, in addition to learning h (as in the previous case), we also learn a number of 
thresholds (K— 1) corresponding to the different ranks that will be assigned to data. The 
learned classifier in this case will be expressed as {h, 0i, ..., 0k-i) with the thresholds 
satisfying fti < 02- ■■ < Ok-i- The ranking rule in this case is 



Although, this model of ranking may seem too simplistic, as we show in Sec. 3 
it is quiet powerful and we give some analysis relating VC-dimension of the learned 
classifier to what we call rank-dimension of the data. Later in Sec. 5, we will show how 




( 1 ) 
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one can extend the above framework to the case where learning needs to be done in a 
space different from the original space. In such a case, learning is done in some high 
dimensional space by making use of kernels for the mapping. 

3 Complexity of Ranking vs Classification 

It has been argued [1] that the ranking problem is much harder than the classification 
problem. Although this is true in the particular view adopted by [1], in this paper we 
present an alternate viewpoint. We analyze the complexity of the ranking problem from 
the view of the VC-dimension. We define the variant of the VC-dimension, called rank- 
dimension, for the ranking problem as follows; if the data points are ranked with respect 
to the value of the functional evaluated on a particular data point, then we say that the 
rank dimension of the functional is the maximum number of points that can be ranked 
in any arbitrary way using this functional. 

Theorem 1. The rank dimension of a linear functional is same as its VC-dimension. 
Following the notation given in Sec. 2, it holds for all Xi,Xj € X we also have Xi~Xj S 
X and Xj — Xi G X 

Proof. Let us consider the case of linear classifier h € 5ft". Say one observes a set of m 
points S = {x\,X 2 , ■■■, Xm\, with the corresponding ranks yi, ?/ 2 , Vm- 

Clearly if we can rank a set of m points in any arbitrary way using a functional, then 
we can always shatter them (at the cost of one additional dimension corresponding to 
the threshold). Consider a subset Sq C S such that we want to label all the points that 
belong to Sq as negative and all the points that belong to S but not to Sq as positive (i.e. 
S\Sq). Now, if we rank all the points in such a way so that the rank of all the points in 
Sq is less than the rank of all the points in S\Sq then we can do the classification by just 
thresholding based on the rank. This shows that the rank-dimension of any functional 
cannot be more than the “VC-dimension” of the same functional. 

We know that the VC-dimension of a linear classifier in n dimensional space is 
n -f 1. That is any set of n -f 1 points (assuming general positions) in n dimensional 
space can be shattered by a n dimensional linear classifier. Now we show that any 
set of n -f 1 points can be ranked in any arbitrary way using a linear classifier in n 
dimensional space. Given any arbitrary ranking of the points, lets re-label the points 
such that rank{xi) < rank{x 2 ) < ... < rank{xn+i) ■ Define a new set of points S' = 
{0, X 2 — x\,xq — X 2 , ■■■, Xn+i ~ Xn}- Now, if we label the points as { — 1, 1, 1, ..., 1}. 
The cardinality of set S'' is n -f 1 (n-difference vectors and one 0 vector.) Also it is easy 
to see that all points in S' lie in 5ft”. Now, from the VC-dimension theory, we know that 
there exists a linear classifier in n dimensional space that can shatter S' according to the 
labelling given above. Let this linear classifier be h, with classification as sign{K^ x) . 
Then for correct classification {xi — Xi-i) > 0 Xi > h^Xi-i. That is the 

distance of the original points from the hyperplane does corresponds to the specified 
ranking. Hence, we have shown that any pair of n -f 1 points can be ranked in any 
arbitrary fashion by a n dimensional classifier and at the same time we have also shown 
that the rank dimension cannot be more than the VC-dimension. This shows that the 
rank-dimension of any classifier is same as its VC-dimension. 
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This is a very interesting result as it shows that the complexity of the hypothesis 
space for the two problems is same. However, as of now, we are not clear about the 
relation between the growth function for the two problems. Further, the relation between 
the computational complexity of the two problems has to be studied. 

4 Ranking, Sorting and Classification 

In the previous section, we saw the close relationship between the classification and 
the sorting problem. It is clear that if we can solve a sorting problem then we can also 
solve a ranking problem (by giving any arbitrary order to all the data points within the 
same rank class). However, it turns out that computationally both problems demand 
individual treatment. In this section, we present two approaches to solve this problem. 
The first is referred to as the difference space approach while the second is referred to 
as the embedded space approach. 

4.1 Difference Space Approach 

Given a training set S, define a new set Sd of difference vectors xfj = Xi — xj; Vi,j : 
Hi ^ Dj and their corresponding labels = sign{yi — yj). This leads to a dataset^ 
of size 0{m^). Learning a linear classifier for this problem would be same as learning 
a ranker h. Once such a ranker is learned, the thresholds for the ranking problem can 
easily be computed. This formulation is same as the one proposed by [4]. Computational 
complexity of most of the learning algorithms depend on the size of the training data 
and a quadratic increase in the size of the data will certainly make most of the existing 
algorithms impractical. Hence, we propose to generate difference vectors only among 
the adjacent rank classes. Formally, given a training set S, obtain a new set Sd made 
up of difference vectors xf^ = Xi — xj] Vijj : yi = yj + 1 and their corresponding 
labels yf^ = +1. This would result in a dataset with only positive examples. Again, 
most standard classification algorithms behave well if the number of positive examples 
is close to the number of negative examples. To get around this problem, once such a 
dataset is obtained, multiply each example xf^ and the corresponding label yfj by 
where is a random variable taking values { — 1,1} with equal probabilities. Clearly, 
learning a linear classifier over this dataset will give a ranker h which will be same as 
the one obtained in the previous case. The size of the dataset in this case is O(^). For a 
small K (which is the case in most iC-ranking problems) this is still too large to handle. 
However, interestingly, for the sorting problem, the size of the new dataset is of the 
same order as the size of the old dataset and therefore this problem can be solved easily. 
Next, we present an approach that specifically handles the ranking problem without 
exploding the size of the training dataset. 

4.2 Embedded Space Approach 

In this section, we present a new framework to handle the ranking problem which can 
be seen as the main contribution of this paper. It is a novel formulation that allows one 

* The different elements of this dataset are not independent of one another. 
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to map a ranking problem to a standard classification problem without increasing the 
size of the dataset. The embedded space approach presented in this section is similar in 
spirit to the model presented in [12], however as we will see shortly in our model, the 
dimension of the new space does not grow linearly as the one presented in their paper. 
Fig. 1 graphically depicts the ranking framework. The distance from the hyperplane h 
of a datapoint x is mapped to a one dimensional space. In this space, 0i, Ok-i are 
the different thresholds against which the distance is compared. Note that K^Xj\ Vxy 
having rank i results in a range represented hy its left end point 6*i_i and its right end 
point Oi. Define ai = 0i+\ — 9 i\ 1 < i < K — 1. For the data items belonging to rank 

1 , there is no lower hound and for all the data items belonging to rank K there is no 

upper hound. By construction, it is easy to see that at > 0; Vz. Note that data point Xj 
having rank i > 1 will satisfy, (assuming ap = 0.) 

K-2 

h Xj > 0j_i; h Xj+ai-i>9i\ h Xj + ^ ak > 9x-i (2) 

k—i—1 

Similarly, for an example with rank i < K, (assuming ^^^ = 00 , ax-i = 0) 

K-2 

rj~\ rj~\ ^ 

h Xj < 9i, h Xj + h Xj + ^ ak < 9 k-i (3) 

k—i 

Based on this observation, define h = [h ai a 2 ■ ■ ■ aK- 2 ] and for an example Xj with 
rank 1 < z < iT, define x+ , x“ as n + iT — 2 dimensional vectors with 

{ Xj[l] 1 < I < n, ( Xj[l] 1 < I < n, 

0 n<l<n + i — l, * 7 ^ “ 1 n<l<n + i, ( 4 ) 

1 n + i— l<l<n + K — 2. [_! n + i<l<n + K — 2. 

For an example Xj with rank z = 1, we define only xJ as above and for an example with 
rank z = Ff, we define only x^ again as above. This formulation assumes fhaf 9k-i = 
0. It is easy to see that one can assume this without loss of generality (by increasing 
the dimension of x by 1 one can get around this.) Once we have defined x^, x“, the 
ranking problem simply reduces to learning a classifier h in n + K — 2 dimensional 
space such that W" x'j > 0; and x~ < 0. This is a standard classification problem 
with at most 2m training examples, half of which have label +1 (examples x^) and rest 
have label ~1 (examples x~). Even though, the overall dimension of the data points 
and the weight vector h is increased by iT — 2, this representation limits the number 
of training data points to be 0{m). Note that although, we have slightly increased the 
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Fig. 1. Graphical representation of the ranking framework. 
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dimension (by K — 2), the number of parameters that needs to be learned is still the 
same (the classifier and the thresholds). Interestingly, any linear classification method 
can now be used to solve this problem. It is easy to prove that if there exists a classifier 
that learns the above rule with no error on the training data then all the a^’s are always 
positive which is a requirement for the classification problem to be same as ranking 
problem. Next we show how one can use kernel classifiers (SVM) to solve the ranking 
problem for datasets where there may not exist a linear ranker. 

5 Kernel Classifiers - SVM 

In many real world problems it may not be possible to come up with a linear function 
that would be powerful enough to learn the ranking of the data. In such a scenario, 
standard practice is to make use of kernels which allow nonlinear mapping of data. We 
will denote a kernel as /C(-, •) = ^^ (•)</>(•) which corresponds to using the non-linear 
mapping (/){■) over the original feature vector. 

Solving Sorting Problems. For the sorting problem, we propose to learn a linear clas- 
sifier over the difference feature vectors and then use this learned classifier to sort the 
data. While using non linear mapping (p with corresponding kernel K, this would imply 

Learn h : yf^{(p{Kf cj){xf^)) > 0 ^ : yfj{(p{hf (j){xi - Xj)) > 0 (5) 

However, we note that (p{xi — xj) ^ (p{xi) — <p{xj) because (f>{) is a non linear map- 
ping. To get around this, we adopt a different strategy (also proposed in [4]). Instead of 
solving the classification problem over the difference vector in the original space, we 
solve the classification problem over the difference vectors in the projected space. 

Learn h : yfj{cp{h)'^ {(p{xi) - (p{xj)) > 0 (6) 

Interestingly, it can be solved easily by defining a new kernel function(It can be easily 
verified that this is a Kernel function) as 

fC{Xi — Xj,Xl — Xm) = )C{Xi, Xl) -f IC{Xj,Xm) ~ IC{Xi,Xm) ~ fC(Xj,Xl) (7) 

Solving Ranking Problems. For solving the ranking problem, we have proposed the 
mapping given in Eqn. 4. One has to be careful in using kernel classifiers with this map- 
ping. To see this, note that if Xj has rank i, then > 0 K^Xj > Oi-i; x~ < 

0 h?' Xj > 9i hut K,(h,x^) = 4>(h)'^ (pix'j ) > 0 ^ (p{xj) > 0i_i This 

is again because of the nonlinearity of the mapping ^(). However, one can again get 
around this problem by defining a new kernel function. For a kernel function 1C and the 
corresponding mapping (p, lets define a new kernel function 1C and with the correspond- 
ing mapping ^() as 

(p{x) = [(p{x)^x[n 4-1 : n + K — 2]]-, <p(h) = [(p{h), h[n + \ : n + K — 2]] (8) 

Note that, only the first n dimensions of x corresponding to x are projected to a higher 
dimensional space. The new kernel function can hence be decomposed into sum of two 
kernel functions where the first term is obtained by evaluation of kernel over the first n 
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dimensions of the vector and second term is obtained by evaluating a linear kernel over 
the remaining dimensions. 

IC{xi, Xj) = IC{xi,Xj) + Xi[n + 1 n + K — 2\^Xj[n + \ ■. n + K — 2] (9) 

However, when using the SVM algorithm with kernels one has to be careful as when 
working in the embedded space, learning algorithms minimize the norm of h and not h 
as should have been the case. In the next section, we introduce the problem of ordinal 
regression and show how one can get around this problem. 

5.1 Reduction to Ordinal Regression 

In this section, we show how one can actually get around the problem of minimizing 
||/i|| as against minimizing ||^||. We want to solve the following problem 

minimize i || h |p; subjectto (h ^ + 6) > 0 (10) 

2 j j 

Based on the analysis in Sec. 4, the inequality in the above formulation for Xj with rank 
Uj can be written as, 

n+fe— 2 n+k—2 

-b- h{l)x~{l) = 9y^ > h^Xj > -b- Y = ^yj-i; Ki<K 

l = n-\-l l = n-\-l 

In this analysis, we will assume that with respect to threshold 9i’s, there is a margin of 
at least e such that for any datapoint Xj with corresponding rank yj, we have 

Oy^-i + e < h^Xj] 1 < yj < K 

Now, the problem given in Eqn. 10 can be reframed as 

minimize 2 II ^ H^’ subjectto hFxj < dy^-, 'i ^ < yj < K h^Xj > dyj — l + e; ^ < Vj ^ K 

This leads to the following lagrange formulation, 

^ m — mji m 

■f'p = 2 II ^ 11^ + Y + Y + 

J=1 J=mi+1 j j 

where rrii refers to number of elements having rank ’i’. The ranker h is obtained by 
minimizing the above cost function under the positivity constraints for 7^ and 7“. 
Dual formulation Ld of the above problem can be obtained by following steps as in 
[11], 

^ n n 

i=l i=i j j 

with constraints, 

rrii rnij^i 

7p = XI V; V*e[2,/T-1] 

p— rrii — i-\-l p—mi-\-l 



( 12 ) 
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Table 1. (left)Comparison of sorting algorithms for linear data, (right) Comparison of sorting 
algorithms for nonlinear data. 



Algorithm 


Ave. No. of Transpositions 


Perceptron 


36.31 


Winnow 


35.68 


linear-SVM 


30.92 



Algorithm 


Ave. No. of Transpositions 


Perceptron 


62771 


Winnow 


62840 


kernel-SVM 


1456.4 



We have introduced € [uk-i + and m € [l,ni] for simplicity of 

notation. These are deterministic quantities with value 0. It is interesting to note that, 
Eqn. 1 1 has the same form as a regression problem. The value of 0i’s is obtained using 
Kuhn Tucker conditions. 



6 Experiments 

In this section, we present experimental results of the various algorithms that have been 
outlined in this paper. We start off by giving results obtained on synthetic data and then 
we show results of the performance of ranking for a novel image processing application. 



6.1 Synthetic Data 

Sorting. First, we compare the performance of Perceptron, Winnow and SVM for sort- 
ing problem on linear data. Linear data is obtained by generating TV dimensional data 
Xi; 1 < i < m. Data points are sorted based on the value of inner product w'^Xi where 
u> is a randomly generated TV dimensional vector. In the following experiments we 
choose N = 5, M = 700 with 200 examples as training data and rest as test data. The 
performance of the three algorithms was analyzed by computing the minimum number 
of adjacent transpositions [7] needed to bring the sorting produced by the learned ranker 
Y to the ground truth Y. 

In Table. 1, we show the averaged number of transpositions normalized by the 
length of the sequence for the three methods. It can be observed that linear-SVMs are 
slightly better than the online learning rules. 

Next we compare the performance of the three algorithms on nonlinear data. The 
data was generated as above with the addition of quadratic non-linearity in the data 
when obtaining the ground truth sorting. Table. 1 shows the performance of the three 
algorithms. A radial basis function is used as kernel and it is clear from the results 
that kernel-SVMs handle non linear data very well. Fig. 2 shows an instance of sorting 
performed by kernel SVMs for nonlinear data. In this figure, the data is plotted with 
respect to the ground truth sorting. It is clearly evident that the non-linear SVM is able 
to capture the inherent nonlinearity. 

Ranking. In addition to various ranking algorithms proposed in the paper, one can 
treat ranking as a multiclass classification problem. We present a comparison of various 
ranking algorithms on synthetic data and study how their behavior varies with the in- 
crease in number of classes/ranks. In particular, we analyze ranking algorithms namely 
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'iWl', 



Test label ID 



Fig. 2. An example of sorting done using kemel-SVM for nonlinear data. The ideal ranking is a 
straight line. 



Table 2. (left)Comparison of Ranking algorithms for linear data. The rows represent the values 
of K and the columns correspond to the algorithms ATSVM using ordinal regression, A2:multi- 
class SVM, A3:SVMs using appending, A4:Perceptron, A5:Winnow. (right) Comparison of K- 
ranking algorithms for nonlinear data. The rows represent different K values ranging from 3 to 10 
and the columns correspond to the algorithms ATSVM using ordinal regression, A2:multi-class 
SVM, A3:SVMs using appending, A4:Perceptron, A5:Winnow. 





Al 


A2 


A3 


A4 


A5 


K=3 


7.8 


15.6 


7.2 


10.1 


9.1 


K=4 


8.3 


34.7 


8.3 


7.0 


6.2 


K=5 


4.5 


52.2 


4.5 


9.0 


5.6 


K=6 


5.1 


68.2 


5.9 


10.6 


6.0 


K=7 


9.8 


72.5 


9.8 


8.3 


8.5 


K=8 


7.7 


82.0 


7.7 


8.9 


8.1 


K=9 


8.5 


83.6 


8.2 


11.1 


9.1 


K=10 


8.0 


93.3 


7.8 


9.6 


6.2 





Al 


A2 


A3 


A4 


A5 


K=3 


22.9 


51.9 


24.3 


78.6 


83.2 


K=4 


29.3 


65.5 


29.0 


95.1 


97.0 


K=5 


35.1 


87.4 


34.8 


105.5 


105.5 


K=6 


37.6 


93.6 


37.7 


113.6 


108.2 


K=7 


36.1 


104.3 


35.6 


114.4 


116.1 


K=8 


30.9 


107.5 


31.1 


116.0 


111.8 


K=9 


40.7 


108.6 


41.1 


116.8 


117.1 


K=10 


38.3 


118.0 


38.2 


123.2 


130.3 



Perceptron, Winnow, SVM using ordinal regression, multi-class SVM and SVMs us- 
ing appending (by mapping the data to new embedded space as explained in Sec. 4. 
Multi-class SVM learns K classifiers where each classifier distinguishes one class cor- 
responding to particular rank from the rest of the classes. SVM using ordinal regression 
and SVM using appending are implementation of the theory discussed in Sec. 5.1 and 
Section. 4 respectively. Linear data is generated and sorted as before. Now the sorted 
set is divided into K equally sized groups and labels are assigned to them ranging from 
[1, K] such that, any item in group i must he placed lower than any other element in 
group i — 1 with respect to some linear hyperplane (unknown to the algorithm). The 
results are tabulated in Table. 2. 

Next, we give the results for the case when there is no linear ranker. Nonlinear data 
is again generated as discussed earlier Table. 2 shows the results obtained in tabular 
format. It is clear from both Table. 2 that SVMs using ordinal regression and appending 
do signihcantly better than the multi-class SVMs because the former methods capture 
the inherent structure in the data. 
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6.2 Automatic Image Focusing 

A very nice application of ranking datasets is to rank images / capturing a particular 
scene based on the level of focus and hence perform defocusing by picking the image 
which has the highest rank among all the images. We choose this particular applica- 
tion [6] [10] [5] to demonstrate the applicability of ranking formulation presented in the 
paper towards solving the task of automatically extracting the best focused image. 

To study the image focusing problem, we started by experimenting with synthetic 
data. 50 images were take from the core! database. As has been argued in literature, 
Gaussian blurring is closely related to the noise in a badly focused image and hence 
we used this method to obtain images which are not focused properly. In particular, 
each image is blurred two times giving us three images with varying level of focus. 
The more the blurring is, the worse is the appearance of the image. This whole process 
resulted in 150 images with varying level of focus. Each of these images were resized to 
obtain a 120x80 image resulting in a 9600 dimensional vector. Instead of representing 
the image in such crude form, we use Gabor 2D wavelets [8] to represent each of the 
image by a 64 dimensional vector. The ranking algorithm discussed earlier is used with 
rbf(radial basis function) kernel to learn the ranking function. The learned function is 
then evaluated on the remaining 120 images corresponding to 40 different scenes (three 
images correspond to each scene). For each scene, the testing task is to figure out the 
clearest (best focused) image out of the three images. In 95% of the cases (that is in all 
but two cases), the original image (without Gaussian noise) is picked as the best focused 
image. 

The real test of the algorithm is when it is subject to real data. To do this, we used a 
digital camera (IBM Web camera) and captured 42 scenes from it. For each scene, three 
images were taken by manually distorting the focus. Some of these images are shown in 
Fig. 3. Note that the difference between the focused and unfocused image is very minor 
(this makes the problem very hard). To test it, we used a novel approach in which we 
took small amount of real data (since we want to test it on as much real data as possible) 
- 5 scenes and used all the synthetic data as the training set. Again all the images were 
of size - 120x80 resulting in a 9600 dimensional feature vector. As with synthetic data, 
the Gabor signatures are used to represent these image as a 64 dimensional vector. 
Again, we made use of our ranking algorithm with rbf kernel function to learn the 
ranker. When tested on remaining 37 scenes, we found that except for 2 cases, in 35 
cases the algorithm picked the correct image. Some of the images ranked by our system 
are presented in Fig. 3. The strength and generalization capabilities of the algorithm 
are clearly evident from the fact that the classifier which was learned with only a small 
number of real training images (and synthetic images) did so well on the test data. 



7 Summary 

In this paper, we have presented an algorithm to reduce both sorting and ranking prob- 
lems to a binary classification problem. We have also shown how one can make use of 
kernels to solve these problems in higher dimensional spaces. Moreover, our reduction 
of the ranking problem to a binary classification problem results in a straightforward 
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Fig. 3. Some synthetic(top row) and real(bottom row) images used to test our system. 



application of online/batch classification algorithms. The main contribution of the pa- 
per is that training can be done with linear amount of data using a novel formulation of 
the ranking problem. Our results on synthetic and real data shows that these algorithms 
are viable and can be used in practice. 
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Abstract. We present a new approach for exploration in Reinforcement Learning 
(RL) based on certain properties of the Markov Decision Processes (MDP). Our 
strategy facilitates a more uniform visitation of the state space, a more extensive 
sampling of actions with potentially high variance of the action- value function es- 
timates, and encourages the RL agent to focus on states where it has most control 
over the outcomes of its actions. Our exploration strategy can be used in combi- 
nation with other existing exploration techniques, and we experimentally demon- 
strate that it can improve the performance of both undirected and directed ex- 
ploration methods. In contrast to other directed methods, the exploration-relevant 
information can be precomputed beforehand and then used during learning with- 
out additional computation cost. 



1 Introduction 

One of the key features of reinforcement learning (RL) is that a learning agent is not 
instructed what actions it should perform; instead, the agent has to evaluate all avail- 
able actions [13], and then decide for itself on the best way of behaving. This creates 
the need for an RL agent to actively explore its environment, in order to discover good 
behavior strategies. Ensuring an efficient exploration process and balancing the risk of 
taking exploratory actions with the beneht of information gathering are of great prac- 
tical importance for RL agents, and have been the topic of much recent research, e.g., 
[14,7,2,4,12]. 

Existing exploration strategies can be divided into two broad classes: undirected 
and directed methods. Undirected methods are concerned only with ensuring sufficient 
exploration, by selecting all actions infinitely often. The e-greedy and Boltzman explo- 
ration strategies are notable examples of such methods. Undirected methods are very 
popular because of their simplicity, and because they do not have additional require- 
ments of storage or computation. However, they can be very inefficient for certain do- 
mains. Eor example, in deterministic goal directed tasks with a positive reward received 
only upon entering the goal state, undirected exploration is exponential in the number 
of steps needed for an optimal agent to reach the goal state [14]. On the other hand, 
by using some information about the course of learning, the same tasks can be solved 
in time polynomial in the number of states and maximum number of actions available 
in each state [14]. The impact of exploration is believed to be even more important for 
stochastic environments. Directed exploration strategies attempt not only to ensure a 
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sufficient amount of exploration, but also to make the exploration efficient, by using ad- 
ditional information about the learning process. These techniques often aim to achieve a 
more uniform exploration of the state space, or to balance the relative profit of discover- 
ing new information versus exploiting current knowledge. Typically, directed methods 
keep track of information regarding the learning process and/or learn a model of the 
system. This requires extra computation and storage in addition to the resources needed 
by general on-line RL algorithms, in order to make better exploration decisions. More 
details on existing exploration methods are given in Section 2.2. 

In this paper, we present a new directed exploration approach, which takes into ac- 
count the properties of the Markov Decision Process (MDP) being solved. In prior work 
[9], we introduced several attributes that can be used to provide a quantitative character- 
ization of MDPs. Our approach to exploration is based on the use of the two attributes: 
state transition entropy and forward controllability. The state transition entropy pro- 
vides a characterization of the amount of stochasticity in the environment. Forward 
controllability measures how much the agent’s actions actually impact the trajectories 
that the agent follows. Our prior experimental results [9] suggest that these attributes 
significantly affect the quality of learning for on-line RL algorithms with function ap- 
proximation, and that this effect is due in part to the amount of exploration in MDPs 
with different characteristics. In this paper, we show how to use these MDP attributes 
in combination with both undirected and directed existing exploration methods. 

Using MDP attributes can improve the exploration process in three ways. First, it 
encourages a more homogeneous visitation of the state space, similar to other existing 
directed methods. Second, it encourages more frequent sampling for actions with po- 
tentially high variance in their action- value estimates. Finally, it encourages the learning 
agent to focus more on the states in which its actions have more impact. One impor- 
tant difference between our exploration strategy and other directed techniques is that 
the extra information we use reflects only properties of the task at hand, and does not 
depend on the history of learning. Hence, this information does not carry the bias of 
previous, possibly unfortunate exploration decisions. Additionally, in some cases the 
MDP attributes can be pre-computed beforehand and then used during learning without 
any additional computational cost. The attributes’ values can also be transferred be- 
tween tasks if the agent is faced with solving multiple related tasks in an environment 
in which the dynamics does not change much. The attributes can also be estimated dur- 
ing learning, which would require only a small constant amount of additional resources 
in contrast to most other directed methods. 

The rest of the paper is organized as follows. In Section 2, we provide background 
on RL and existing exploration approaches. The details of the proposed exploration 
method are presented in Section 3. Empirical results are discussed in Section 4. The 
directions for future work are presented in Section 5. 

2 Background 

2.1 RL Framework 

We assume the standard RL framework, in which a learning agent is situated in a dy- 
namic stochastic environment and interacts with it at discrete time steps. The envi- 
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ronment assumes states from some state space S and the agent chooses actions from 
some action space A. On each time step, in response to the agent’s actions, the envi- 
ronment undergoes state transitions governed by a stationary probability distribution 
, where s,s' S 5, a € A. At the same time, the agent receives a numerical reward 
from the environment, which reflects the one-step desirability of the agent’s 

actions. State transitions and rewards are, in general, stochastic and satisfy the Markov 
property, their distributions depend only on the current state of the environment and the 
agent’s current action and are independent of the past history of interaction. The goal 
of the agent is to adopt a policy (a way of choosing actions) n : S xA^ [0,1] that op- 
timizes a long-term performance criterion, called return, which is usually expressed as 
a cumulative function of the rewards received on successive time steps. Such a learning 
problem is called a Markov Decision Process (MDP). Many RL algorithms estimate 
value functions which can be viewed as utilities of states and actions. Value functions 
are estimates of the expected returns and take into account any uncertainty pertaining 
to the environment or the agent’s action choices. For instance, the action-value function 
associated with a policy 7t, Q’' : 5 x A ^ 91, is defined as: 

Q^(s,a) = E^{rt+i +Y'‘r+2+ • • • |v = s,at = a} 

where y G (0, 1] is the discount factor. The optimal action value function, Q*, is defined 
as the action-value function of the best policy: Q*{s,a) = max,! Q’^{s,a). In this paper, 
we focus on RL algorithms that estimate the optimal action-value function from samples 
obtained by interacting with the environment. 

2.2 Exploration in RL 

The goal of an exploration policy is to allow the RL agent to gather experience with 
the environment in such a way as to find the optimal policy as quickly as possible, 
while also gathering as much reward as possible during learning. This goal can be itself 
cast a learning problem, often called optimal learning [6] . Solving this problem would 
require the agent to have a probabilistic model of the uncertainty about its own knowl- 
edge of the environment, and to update this model as learning progresses. Solving the 
optimal learning problem then becomes equivalent to solving the partially observable 
MDP (POMDP) defined by this model, which is generally intractable. However, vari- 
ous heuristics can be used to decide which exploration policy to follow, based only on 
certain aspects of the uncertainty about the agent’s knowledge of the environment. 

As discussed in Section 1, existing exploration techniques can be grouped in two 
main categories: undirected and directed methods. Undirected methods ensure that each 
action will be selected with non-zero probability in each visited state. For instance, the 
^-greedy exploration strategy selects the currently greedy action (the best according to 
the current estimate of the optimal action-value function Q{s,a)), in any given state, 
with probability (1 — e) , and selects a uniformly random action with probability e. 
Another popular choice for undirected exploration, the Boltzman distribution assigns 

Q{s,a) Q(^ib) 

probability n{s^a) of taking action a in state s as n{s^a) = ^ where T 

is a positive temperature parameter that decreases the amount of randomness as it ap- 
proaches zero. When using on-policy RL algorithms, such as SARSA [13], the explo- 
ration rate (8 in the e-greedy exploration and T in Boltzman exploration) has to decrease 
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to zero with time in an appropriate manner [11] in order to ensure convergence to the 
optimal (deterministic) policy. In practice, however, constant exploration rates are often 
used. 

Directed exploration methods typically keep some information about the state of 
knowledge of the agent, estimating certain aspects of its uncertainty. The action to be 
taken is usually selected by maximizing an evaluation function that combines action- 
values with some kind of exploration bonuses, 5,: 

N{s,a) = KoQ{s,a) + Ki?n{s,a) + ... + Kk?>k{s,a) (1) 

Exploration is driven mainly by the exploration bonuses that change over time. The 
positive constants A) control the exploration-exploitation balance. 

Directed exploration methods differ in the kind of exploration bonuses they define, 
which reflect different heuristics regarding what states are important to revisit. For ex- 
ample, counter-based methods [14] direct exploration toward the states that were visited 
least frequently in the past. Recency-based exploration [14, 12] prefers instead the states 
that were visited least recently. In both of these cases, the result is a more homogeneous 
exploration of the state space. Error-based exploration [10] prefers actions leading to 
states whose value changed most in past updates. Interval Estimation (IE) [3, 16], as 
well as its global equivalent, IEQLh- [7], bias exploration toward actions that have the 
highest variance in the action value samples. In the value of information strategy [1, 
2], the exploration-exploitation tradeoff is solved with a myopic approximation of the 
value of perfect information. The algorithm [4] learns a model of the MDR Based 
on the estimated accuracy of this model and a priori knowledge of the worst-case mix- 
ing time of the MDP and the maximum attainable returns, explicitly balances the 
profit of exploitation and the possibility of efficient exploration. Due to this balancing, 
E^ provably achieves near-optimal performance in polynomial time. However, there is 
little practical experience available with this algorithm. 

3 Using MDP Attributes for Exploration 

Similarly to many directed exploration methods, the goal of our approach is to ensure 
a more uniform visitation of the state space, while also gathering quickly the samples 
most needed to estimate well the action value function. In order to achieve this goal, 
we focus on using two attributes that can be used to characterize MDPs: state transition 
entropy (STE) and forward controllability (EC). In prior work [9], we found that these 
attributes had a significant effect on the speed of learning and quality of the solution 
found by on-line RL algorithms. This effect seemed to be due mostly to their infiuence 
on the RL agent’s exploration of the state space. Both attributes can be computed for 
each state-action pair (i', a) based on the MDP model (if it is known) or they can be esti- 
mated based on sample transitions. The basic idea of our strategy is to favor exploratory 
actions which exhibit high values of STE, EC, or both of these features. We will now 
explain the details of our approach. 

State transition entropy (STE) measures the amount of stochasticity due to the en- 
vironment’s state dynamics. Let Og a G S denote a random variable representing the 
outcome (next state) of the transition from state s when the agent performs action a. 
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Using the standard information-theoretic definition of entropy, STE for a state-action 
pair (s,fl) can be computed as follows [5]: 

STE{s,a) = H{Os,a) = - X Ks’ (2) 

s'eS 

A high value of ST E {s, a) means that there are many possible next states s' (with 
^ss' ^ which occur with similar probabilities. If in some state s, actions a\ and 02 are 
such that STE{s,a\) > STE { 5 , 02 ), the agent is more likely to encounter more different 
states by taking action ai than by taking action 02 - This means that giving preference 
to actions with higher STE could achieve a more homogeneous exploration of the state 
space. Empirical evidence that a homogeneous visitation of the state space can be help- 
ful is present in [13], where the performance of Q-learning with an e-greedy behavior 
policy is compared with the performance of Q-learning performed by picking states 
uniformly randomly. The experiments were performed on discrete random MDPs with 
different branching factors. Note that a large branching factor means a high STE value 
for all states. In these tasks, the e-greedy on-policy updates resulted in better solutions 
and faster learning mainly for the deterministic tasks (with branching factor 1). As the 
branching factor (and thus STE) increased, performing action-value updates uniformly 
across the state space led to better solutions in the long run, and to better learning speed. 

Another potential consequence of a high value of STE{s, a) is a large variance of the 
action-value estimates for ( 5 , a). In on-policy learning methods, such as SARSA [13], 
the action value of a state-action pair (^, a) is updated toward a target estimate obtained 
after taking action a: 

Q{s,a) ^ { \ - a)Q{s,a) + a[Rl^, + yQ{s' ,a')],a £ (0,1) 

Target for (s, a) 

These target estimates are drawn according to the probability distribution of the next 
state s' . If STE{s,a) is high, there will be many possible next states, and consequently 
the variance in the target estimates could be higher. In prior experiments using the 
SARSA(O) learning algorithm with linear function approximation [9], we observed that 
in environments with high STE values, there was a trade-off in the quality of the approx- 
imation achieved between the positive effect of ’’natural” exploration and the negative 
effect of high variance in the target action-value estimates used by the algorithm. In 
order to get a good estimate of Q{s,a) when the target values have high variance, more 
samples are needed. By encouraging the exploration of actions with high STE values, 
our strategy ensures that we will collect enough samples. This idea is reminiscent of 
the IE directed exploration method [3], but we do not rely on explicitly estimating the 
variance of the action value samples, which would be much more expensive in terms of 
both storage and computation. 

The controllability of a state s is a normalized measure of the information gain when 
predicting the next state based on knowledge of the action taken, as opposed to making 
the prediction before an action is chosen (a similar, but not identical, attribute is used 
by Kirman [5]). Let Os G S denote a random variable representing the outcome of a 
uniformly random action in state s. Let Os G S denote a random variable representing 
the outcome of a uniformly random action in state s. Let A* denote a random variable 
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representing the action taken in state We consider to be chosen from a uniform 
distribution. Given the value of A,, information gain is the reduction in the entropy of 
Os'. H{Os)-H{Os\As), where 



H{0.) = 



H{0, |A,) = - ^ P^, log(P“^, ) 



The controllability in state j is defined as: 



C(.) 



H{Os) 



(3) 



If all actions are deterministic, then H{Os\As) = 0 and C(i) = 1. If H{Os) = 0 (all ac- 
tions deterministically lead to the same state), then C{s) is defined to be 0. The forward 
controllability (FC) of a state-action pair is the expected controllability of the next state: 



FC{s,a) = X 
s'eS 



(4) 



Favoring actions with high FC will direct an RL agent toward states in which it 
has a lot of control on the next state transitions, by making appropriate action choices. 
Having such control enables the agent to reap higher returns in environments where 
some trajectories are more profitable than others, as shown in our prior experiments [9]. 
At the same time, actions with high FC lead to states in which different actions have 
very different outcomes. Hence, from such states, the agent is likely to explore the state 
space more uniformly. A third reason to favor actions with high values of FC{s,a) is 
that, similarly to the case of high STE, such actions can potentially have high variance 
in the targets used to update their action values, Q{s,a). If a resulting state, s\ is highly 
controllable, the actions a' available there could lead to very different next states, and 
hence Q{s' ,a') is likely to have high variance. As a result, gathering more samples from 
{s, a) should increase the speed of learning. 

The idea of guiding exploration based on the values of the STE and EC attributes 
can easily be incorporated in both undirected and directed exploration techniques. Eor 
instance, consider the case of the e-greedy exploration strategy. The greedy action is still 
chosen with probability (1 — e). When a choice to explore is made (with probability e), 
the exploratory action is selected according to a Boltzman distribution: 

Kf tSTE(s,a)+K^ *FC(s,a) 

6 ^ 

Tt{s,a) — AT| *STE{s,b)+K2*FC{s,b) ' 

where T is the temperature parameter. The nonnegative constants Ki and K 2 can be used 
to adjust the relative contribution of each term. Of course, STE and EC can be used with 
probability distributions other than Boltzman as well. 

In directed exploration, the STE and EC attributes can be used as additional explo- 
ration bonuses, and hence can be easily incorporated in most existing methods. In this 
case, the behavior policy deterministically picks the action maximizing the function: 

N{s,a) = KoQ{s,a)+ KiSTE{s, a) + K 2 FC{s, a) + {s, a) 



(6) 
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where bj{s,a) can be any exploration bonuses based on data about the learning pro- 
cess, such as counter-based, recency-based, error-based or IE-based bonuses. In this 
case, the trade-off between exploitation and exploration can be controlled by tuning the 
parameters Ki associated with each term. 

Note that our exploration approach uses only characteristics of the environment, 
which are independent of the learning process. Thus, the information needed can be 
gathered prior to learning. This can be done if the transition model is known, or if the 
agent has an access to a simulator, with which it can interact to estimate the attributes 
from sampled state transitions*. Also, the attributes’ values can be carried over if the 
task changes slightly (e.g., in the case of goal-directed tasks in which the goal location 
moves over time). Alternatively, the attributes can be computed during learning based 
on observed state transitions. This can be done efficiently by incremental methods for 
entropy estimation [15] and mean estimation with a forgetting factor for FC. In this 
case, only a small constant amount of extra computation per time step is needed. This 
is in contrast to most other directed exploration methods, which not only rely on es- 
timation of transition probabilities, but also require more computation to re-evaluate 
their exploration-relevant information on every time step, e.g., [14,4, 12,3, 16,2]. At 
the same time, the exploration-relevant information based on the learning history used 
in other directed techniques can carry the bias of previous (possibly unsuccessful) ex- 
ploration decisions and value estimates. 

4 Experimental Results 

In order to assess empirically the merit of using STE and FC as heuristics for guiding 
exploration, we experimented with using these attributes together with e-greedy explo- 
ration (as a representative of undirected methods) and recency-based exploration (as a 
representative of directed methods). We chose recency-based exploration among the di- 
rected exploration techniques because in previous experiments [14] it compared favor- 
ably to other directed methods, while being less sensitive to the tuning of its parameters. 
At the same time, this method is conceptually close to attribute-based exploration, in 
that it encourages a homogeneous exploration of the state space. Hence, it is interesting 
to see whether the use of MDP attributes can give any additional benefit in this case. 

The attributes were incorporated into the e-greedy strategy as shown in (5). We used 
parameter settings Ki,K 2 G {0,1}, T = 1 and e € {0.1, 0.4, 0.9}. The recency-based 
technique was combined with the attributes based on the idea of additive exploration 
bonuses, as shown in (6), where we used one recency-based exploration bonus, 5(s,a). 
As before, we used K\,K 2 G {0, 1}. The constant corresponding to the value function 
was set to A]) G {1, 10,50} and the constant corresponding to the recency bonus was 
^^3 = 1 . The learning algorithm used was tabular S ARS A(0) with a decreasing learning 
rate a{st^at) = q a ) ’ "'here n{st,at) is the number of visits to a state-action pair 
(st,at) at time t. The action values Q{s,a) were initialized to zero at the beginning of 
learning. 

* Note that even if the MDP model is known, it is often not feasible to apply dynamic program- 
ming methods and the issue of efficient exploration is still important. As suggested in [17], 
model-based exploration methods are in fact superior to model-free methods in many cases. 
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Group of Low-STE Tasks Group of Low-STE Tasks 




Fig. 1. Performance of e-greedy exploration (pure and attribute-based) for low-STE (top) and 
high-STE tasks (bottom) 



The experiments were conducted on randomly generated MDPs with 225 states and 
3 actions available in every state. The branching factor for these MDPs varied randomly 
between 1 and 20 across the states and actions. Transition probabilities and rewards 
were also randomly generated, with rewards drawn uniformly from [0, 1]. At each state, 
there was a 0.01 probability of terminating the episode. These random MDPs were 
divided in four groups of five tasks each. Two of the groups contained MDPs with 
’’low” average STE values {3iVg{STE{s,a)) < 1.7), and the other two groups contained 
MDPs with “high” STE values (avg{STE{s^a)) G [1.7, 2. 7]). This grouping allowed us 
to investigate whether the overall amount of stochasticity in the environment influences 
the effect of the attributes on exploration. The two groups on each STE level differed in 
that one of them (which we will call the test group) had a large variation in the attribute 
values for different actions, while the other one (the control group) had similar values 
of the attributes across all states and actions. In the control groups, we would expect to 
see no effect of using the attributes, because the exploration decisions at all states and 
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actions should be mostly unaffected by the attribute values. Hence, we use the control 
groups to test the possibility of observing any effect “by chance”. The experimental 
results presented below are for the case, where the attributes where precomputed from 
simulation of the MDPs prior to learning. Preliminary experiments, where the attributes 
were computed during learning, indicate qualitatively similar results. 

We use two measures of performance for the exploration algorithms under consid- 
eration. The first measure is an estimate of the return of the greedy policy produced 
by the algorithms at different points in time. After every 50 trials, we take the greedy 
policy with respect to the current action values and we run this policy from 50 fixed 
test states, uniformly sampled from the state space. We run 30 trials from each such 
state, then we average these returns over the trials and over the 50 states. Because we 
are using different tasks, with different optimal value functions (and hence different up- 
per bounds on the performance that can be achieved), it is difficult to compare greedy 
returns directly, without any normalization. Hence, we normalize the average greedy 
return by the average return of the uniformly random policy from the same 50 states 
(computed over 30 trials). In our prior experiments [9] we found that this normalization 
yields very similar results to normalizing by the return of the optimal policy. Of course, 
using the optimal policy would generally give the best normalization, but the optimal 
policy cannot always be computed by independent means. 

The second performance measure that we use is aimed at providing a quantitative 
measure of both the speed of learning and the quality of the solution obtained. It is often 
difficult to compare different algorithms in terms of both of these measures, because one 
algorithm may have a steeper learning curve, but a more erratic (or worse) performance 
in the long run. In order to assess these kinds of differences, we use the following 
penalty measure for each run: 

P=j^URmax-R'), (7) 

t=\ '' 

where Rmax is an upper limit of the (normalized) return of the optimal policy^, R' is the 
(normalized) greedy return after trial t and T is the number of trials. In this way, failure 
to achieve the best known performance is penalized more after more learning trials have 
occurred. This measure gives a lower penalty to methods that achieve good solutions 
earlier and do not deviate from them. In our experiments, we compute one penalty for 
every independent run of every algorithm (which can be viewed as a “summary” for the 
run). 

The results of the experiments are presented in Figure 1 , for the e-greedy strategy, 
and in Figure 2, for recency-based exploration. The performance measures are com- 
puted in terms of the normalized greedy returns, averaged over the 5 MDPs in each 
group and over 30 runs for each MDP. The left panels represent learning curves for the 
normalized greedy returns, while the right panels represent the average penalty mea- 
sure over the runs, computed using (7). Light lower portions of the bars represent mean 
penalty, and they are topped with standard deviation (dark portions). 

We also performed statistical tests to verify whether the observed performance dif- 
ferences are statistically significant. Because we are interested in both the asymptotic 

^ This limit can be either known or estimated as a maximum return ever observed for a task. 
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Group of Low-STE Tasko 





Fig. 2. Performance of pure and attribute-based recency exploration for the low-STE (top) and 
high-STE tasks (bottom) 



performance and the speed of learning, we used a randomized ANOVA procedure [8] 
to compare the learning curves of the different algorithms. This procedure is more ap- 
propriate than the conventional one for comparing learning curves, because it does not 
rely on the assumption of homogeneity of co-variance, which is violated when there are 
carry-over effects. We performed the analysis separately on the learning curves for each 
task. We also performed two-way ANOVA of the penalty measure averaged over the 5 
MDPs in each group. In this case, one factor was the tunable parameter for the “pure” 
exploration strategy (e for the 8-greedy and K for the recency-based) and the other factor 
was the variant of the corresponding strategy (pure vs. using the attribute(s)). 

As shown in Figure 1 , incorporating the attributes into the 8-greedy strategy has a 
positive effect both for the low-STE and for the high-STE test, in all cases except 8 = 0.1 
in high-STE environments. The randomized ANOVA test for learning curves showed a 
difference in the performance between the pure strategy and each of the three attribute- 
based variants at the level of significance no smaller than p = 0.008 for each task and 
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for each setting of e. The penalty measure graphs show that the positive effect of using 
STE becomes more significant as 8 increases, and this trend is especially pronounced in 
the case of the high-STE tasks. In this case, most estimates Q(s,a) have high variance, 
but with a small exploration rate, many actions are not sufficiently sampled. Using STE 
allows more samples to be gathered for such actions, and hence improve the solution 
quality. EC has a greater positive effect for the high-STE tasks, as can be seen from 
the penalty graphs in the right panels of Fig.l, mainly because it improves the speed 
of learning (the learning curves are not shown here, due to lack of space). This shows 
that encouraging the agent to learn about states where it can better control the course of 
state transitions is helpful especially given a background of high overall stochasticity. 
The two-way ANOVA test on the penalty measure showed that the positive effect of 
using the attributes is significant. 

Figure 2 presents the same performance measures when incorporating the attributes 
in the recency-based exploration strategy. The recency-based method is significantly 
more robust to the tuning of its main parameter, Kq than the 8-greedy strategy is to 
the tuning of 8. With all the settings of Kq, the performance of this strategy is close 
to the best performance of the 8-greedy strategy (obtained at 8 = 0.9). However, using 
the MDP attributes further improves performance of the recency-based method as well, 
although the effects appear to be smaller than in the case of the 8-greedy method (we 
believe this is due to a ceiling effect). Although the differences appear to be small, the 
statistical tests show that most differences are significant. In particular, the randomized 
ANOVA test shows a significant difference between learning curves in the low-STE 
group at the level no smaller than p = 0.04 for all tasks and all attribute versions. For 
the high-STE tasks, significance levels range from p = 0.04 for the version using only 
EC to p = 0.226 for the version using only STE. The two-way ANOVA on the penalty 
measure is also less significant for the recency-based strategy in the high-STE group 
(p = 0.03 for comparison of the pure vs. FC-based variant and p = 0.11 for pure vs. 
STE-based variant). Similar to the case of the 8-greedy strategy, FC appears to have a 
greater positive effect for the high-STE tasks. 

For both the 8-greedy and recency-based strategies, in most cases, using STE and 
EC together produces an improvement which is very similar to the best improvement 
obtained by using either one of the attributes in isolation. Eor the low-STE test group, 
the STE attribute brings a bigger performance improvement, whereas for the high-STE 
test group, FC has a bigger effect. Thus, it would be reasonable to always use the com- 
bination of two attributes to achieve the best improvement. Note that the improvements 
were obtained without tuning any additional parameters, both for the 8-greedy and the 
recency-based methods. 

The results of the experiments conducted on the control groups did not reveal any 
effect of using the attributes with either the 8-greedy or recency-based exploration. This 
reinforces our conclusion that the effects observed on the test groups are not spurious. 



5 Conclusions and Future Work 

In this paper, we introduced a novel exploration approach based on the use of specific 
MDP characteristics. Exploration decisions are made independently of the course of 
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learning so far, based only on properties of the environment. Our technique facilitates a 
more homogeneous exploration of the state space, a more extensive sampling of actions 
with a potentially high variance of the action- value function estimates and encourages 
the agent to focus on states where it has most control over the outcomes of its actions. 
In our experiments, using these attributes improved performance for both undirected (e- 
greedy) and directed (recency-based) exploration in a statistically significant way. The 
improvements were obtained without tuning any additional parameters. The attribute 
values can be pre-computed before the learning starts, or they can be estimated during 
learning. In the latter case, the amount of additional storage and computation is much 
less compared to other directed techniques. 

We are currently conducting a more detailed empirical study using toy hand-crafted 
MDPs in order to better understand the circumstances under which the use of MDP 
attributes to guide exploration is most beneficial. We also plan to investigate the use of 
other attributes, e.g. the risk of taking exploratory actions and variance of immediate 
rewards. 
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Abstract. Many machine learning tasks contain feature evaluation as 
one of its important components. This work is concerned with attribute 
estimation in the problems where class distribution is unbalanced or 
the misclassification costs are unequal. We test some common attribute 
evaluation heuristics and propose their cost-sensitive adaptations. The 
new measures are tested on problems which can reveal their strengths 
and weaknesses. 



1 Introduction 

Feature (attribute) evaluation is an important component of many machine 
learning tasks, e.g., feature subset selection, constructive induction, decision and 
regression tree learning. In feature subset selection we need a reliable and prac- 
tically efficient method for estimating the relevance of the features to the target 
concept, so that we can tackle learning problems where hundreds or thousands of 
potentially useful features describe each input object. In the constructive induc- 
tion we try to enhance the power of the representation language and therefore 
introduce new features. Typically many candidate features are generated and 
again we have to evaluate them in order to decide which to retain and which to 
discard. While constructing a decision or regression tree the learning algorithm 
at each interior node selects the splitting rule (feature) which divides the prob- 
lem space into subspaces. To select an appropriate splitting rule the learning 
algorithm has to evaluate several possibilities and decide which would partition 
the given problem most appropriately. Feature rankings and numerical estimates 
provided by evaluation algorithms are also an important source of information 
for a human understanding of certain tasks. 

While historically the majority of machine learning research have been fo- 
cused on reducing the classification error, there also exists a corpus of work on 
cost-sensitive classification where all errors are not equally important (see on- 
line bibliography [13]). In general, differences in importance of errors are handled 
through the cost of misclassification. 

This work is concerned with the cost-sensitive attribute estimation and we 
assume that costs can be presented with the cost matrix C, where C{i,j) is the 
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cost (could also be benefit) associated with prediction that an example belongs 
to the class tj where in fact it belongs to the class r^. The optimal prediction 
for an example x is the class Ti that minimizes the expected loss: 

C 

L{^,n) = ^P(Tj|x)C'(j,i), 

where P(t^|x) is the probability of the class tj given example x. The task of a 
learner is therefore to estimate these conditional probabilities. Feature evaluation 
measure need not be cost-sensitive for decision tree building, as shown by [1,3,4]. 
However, cost-sensitivity is a desired property of an algorithm which tries to 
rank or weight features according to their importance. Such ranking can be 
used for feature selection and feature weighting or shown to human experts to 
confirm/expand their domain knowledge. This is especially important in the 
fields like medicine where experts posses great deal of intuitive knowledge. 

We will investigate some properties of attribute evaluation measures, like 
how do they behave on imbalanced data sets, scale with increasing number of 
classes, whether they detect (conditional) dependencies between attributes and 
to what extent they are cost-sensitive. We propose several cost-sensitive variants 
of common attribute evaluation measures and test them on artificial data sets 
which can reveal their properties. 

Throughout the paper we use the notation where each learning instance 
/i, l 2 , ..., /rt is represented by an ordered pair (xfe,r), where each vector of at- 
tributes Xfe consists of individual attributes i = 1, ...,a, (a is the number of 
attributes) and is labeled with the target value Tj, j = 1, ..., c (c is the number 
of class values) . Each discrete attribute has values Oi through . Notation 
lij presents the value of j-th attribute for the instance li, and li^r presents its 
class value. We write p(ai^k) for the probability that the attribute Ai has value 
a,k, p{Tk) is the probability of the class r^, and p{Tj\ai^k) is the probability of the 
class Tj conditioned by the attribute Ai having the value Ofc. 

The paper is organized into 5 sections. In Section 2 we review some selected 
attribute evaluation measures and in Section 3 we test how imbalanced class 
distribution affects their performance. In Section 4 we describe how to extend 
these measures to use the information from the cost matrix and in Section 5 we 
evaluate the proposed extensions. Section 6 concludes the work. 

2 Attribute Evaluation Measures 

The problem of attribute estimation has received much attention in the litera- 
ture. There are several measures for estimating attributes’ quality. In classifica- 
tion problems these are e.g., Gini index [1], Gain ratio [11], Relief [5], ReliefF 
[6], MDL [7], and DKM [2]. 

Except Relief and ReliefF all these attribute evaluation measures are impu- 
rity based, meaning that they measure impurity of the class value distribution. 
They assume the conditional (upon the class) independence of the attributes. 
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evaluate each attribute separately and not take the context of other attributes 
into account. In problems which possibly involve much feature interactions these 
measures are not appropriate. Relief and ReliefF do not make this assumption 
and can correctly evaluate attributes in problems with strong dependencies be- 
tween the attributes. We will first present measures based on impurity followed 
by ReliefF. 



2.1 Impurity Based Measures 

These measures evaluate each attribute separately by measuring impurity of the 
splits resulting from partition of the learning instances according to the values 
of the evaluated attribute. The general form of all impurity based measures is: 

M{Ai) = i(r) - ^p(aij)i(r|aij) , 
i=i 

where *(t) is the impurity of class values before the split, and i{T\ai^k) is the 
impurity of class values after the split on Ai = akj- By subtracting weighted 
impurity of the splits from the impurity of unpartitioned instances we measure 
gain in the purity of class values resulting from the split. Larger values of M{Ai) 
imply pure splits and therefore good attributes. We cannot directly apply these 
measures to numerical attributes, but we can use any of the number of discretiza- 
tion techniques first and then evaluate discretized attributes. We consider three 
measures as examples of impurity based attribute evaluation. 



Gain Ratio [11] is implemented in C4.5 program and is the most often used 
impurity based measure. It is defined as 



GR{A,) 



Y!i=iP{Ti) ^ogpjTi) - T!i=lP{T^\a^,3) \ogp{n\aij) 

E“=iP(ai.i)logp(a,j) 



( 1 ) 



Its gain part tries to maximize the difference of entropy (which serves as impurity 
function) before and after the split. To prevent excessive bias towards multiple 
small splits the gain is normalized with the attribute’s entropy. 



DKM [2] has the following form of impurity function: 

){l-p{Tmax)) , where p{Trnax) = TOakxp{n) (2) 

i—1 

is the most probable class value (the one which labels the split). Drummond 
and Holte [3] have shown that for binary attributes this function is invariant to 
changes in the proportion of different classes, i.e. it is cost-insensitive. 
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MDL is based on Minimum Description Length principle and measures the 
quality of attributes as their ability to compress the data. The difference in 
coding length before and after the value of the attribute is revealed corresponds 
to the difference in impurity. Kononenko [7] has shown empirically that this 
criterion has the most appropriate bias concerning multi-valued attributes among 
a number of other impurity-based measures. It is defined as: 



MDL{Ai) 




i=i 



logs 



TZ-ij! , . . . 5 Tic 



+ logs 



/n-b c-l- 1\ 

V c-1 ) 






logs 



c — V 



c — 1 



( 3 ) 



Here n is the number of training instances, rii, the number of training instances 
from class i, n,j the number of instances with j-th value of given attribute, and 
riij the number of instances from class i with j-th value of the attribute. 



2.2 ReliefF 



ReliefF algorithm [6,12] is an extension of Relief [5]. Unlike Relief it is not lim- 
ited to two class problems, is more robust, and can deal with incomplete and 
noisy data. The idea of Relief and ReliefF is to evaluate partitioning power of at- 
tributes according to how well their values distinguish between similar instances. 
An attribute is given a high score if its values separate similar observations with 
different class and do not separate similar instances with the same class values. 
ReliefF samples the instance space, computes the differences between predictions 
and values of the attributes and forms a statistical measure for the proximity of 
the probability densities of the attribute and the class. Assigned quality evalu- 
ations are in the range [—1, 1]. 

Pseudo code of the algorithm is given on Figure 1. ReliefF randomly selects 
an instance Ri (line 3), and then searches for k of its nearest neighbors from the 
same class, called nearest hits H (line 4), and also k nearest neighbors from each 
of the different classes, called nearest misses M{t) (lines 5 and 6). It updates the 
quality estimation for all attributes depending on their values for Ri, hits H 
and misses M{t) (lines 7 and 8). The process is repeated for m times. 

The update formula balances the contribution of hits and all the misses, and 
averages the result of m iterations: 



Wy = Wy COn(Ai 

m 



, Ri, H) -\ ^ 

m ^ 



p{Tt)con{Ay, Rj, M{C)) 
1 -p{Ri^r) 



( 4 ) 



where con{ Ay, Ri, S) is the contribution of k nearest instances from the set S 
(hits or misses). In the simplest case it can be an average difference of attribute’s 
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Algorithm ReliefF 

Input: for each training instance a vector of attribute values and the class value 
Output: the vector W with the evaluation for each attribute 

1. for V = 1 to a do Wv = 0 

2. for i = 1 to m do begin 

3. randomly select an instance Ri 

4. End k nearest hits H 

5. for each class t yf Ri^T do 

6. from class t find k nearest misses M(t) 

7. for V = 1 to a do 

8. update W-u according to Eq. (4) 

9. end; 



Fig. 1. Pseudo code of ReliefF algorithm. 



values for k instances: 



1 , 

con{Ay,Ri,S) = diS{ Ay, Ri,Sj) . 

^ i=i 

Here /„) denotes the difference between the values of the attribute Ay 

for two instances It and For nominal and numerical attributes, respectively, 
it can be defined as: 

\It,V — Iu,v\ 
n 'I ■ 

max Ii V — min R „ 

1=1 ’ 1=1 ’ 






In this work we use exponentially decreasing weighted contribution of instances 
ranked by distance {k = 70, cr = 20 as recommended by [12]): 



con{Ay,Ri,S) 



YH=idiA{Ay,R,,S,)e 



^ rank(i?j ^ * 



^ ^ rank(Rj , Sj ) ^ 2 



E h. 

1=1^ 



In (4) the contribution of each misses’ class is weighted with the prior prob- 
ability of that class p{Tt). Since the contributions of hits and misses in each step 
should be in [0, 1] and also symmetric, the misses’ probabilities have to sum to 
1 . As the class of hits is missing in the sum we have to divide each probability 
weight with factor 1 —p{Ri^T-). 

Selection of k hits and k misses from each class instead of just one hit and miss 
and weighted update of misses is the basic difference to Relief. It ensures greater 
robustness of the algorithm concerning noise and favorable bias concerning multi- 
valued attributes and multi-class problems. 
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Table 1. Characteristics of the problems. 



name 


c 


class distribution 


a 


#inf #rnd 


n £ distribution by (7) 


C2u 


2 


0.5, 0.5 


9 


4 


5 


1000 


0.05 0.95 


C2i 


2 


0.9, 0.1 


9 


4 


5 


1000 


0.31 0.69 


C3u 


3 


0.33, 0.33, 0.33 


11 


6 


5 


1000 


0.06 0.18 0.76 


C3i 


3 


0.8, 0.15, 0.05 


11 


6 


5 


1000 


0.33 0.30 0.37 


C5u 


5 


0.2, 0.2, 0.2, 0.2, 0.2 


15 


10 


5 


1000 0.01 


0.01 0.03 0.06 0.89 


C5i 


5 0.5, 0.3, 0.15, 0.04, 0.01 


15 


10 


5 


1000 0.16 


0.17 0.22 0.12 0.33 


C2xu 


2 


0.5, 0.5 


13 


8 


5 


1000 


0.05 0.95 


C2xi 


2 


0.9, 0.1 


13 


8 


5 


1000 


0.31 0.69 


C3xu 


3 


0.33, 0.33, 0.33 


17 


12 


5 


1000 


0.06 0.18 0.76 


C3xi 


3 


0.8, 0.15, 0.05 


17 


12 


5 


1000 


0.33 0.30 0.37 


C5xu 


5 


0.2, 0.2, 0.2, 0.2, 0.2 


25 


20 


5 


1000 0.01 


0.01 0.03 0.06 0.89 


C5xi 


5 0.5, 0.3, 0.15, 0.04, 0.01 


25 


20 


5 


1000 0.16 


0.17 0.22 0.12 0.33 



3 Imbalanced Data Sets 



Misclassification costs are often closely related with imbalanced distribution of 
class values in the data set (rare classes usually being of higher interest). We first 
test an ability of described measures to detect attributes which identify minority 
class values and, for now, we do not assume any knowledge of costs. For that 
matter we constructed three problems, C2, C3 and C5 with 2, 3, and 5 class 
values (available labels are cl, c2, c3, c4, or c5). For each class value (2, 3, or 
5) we construct two binary attributes A-c?-90 and A-c?-70 (with values 0 and 
1). Each binary attribute identifies one class value in 90% or 70% of the cases 
(e.g., the value of attribute A-c2-90 is 1 in 90% of the cases where the instance 
is labeled with c2; if label is different from c2, the attribute’s value is randomly 
assigned). In each problem we also have 5 binary random attributes (R-50, R-60, 
R-70, R-80, and R-90), with 50%, 60%, 70%, 80%, and 90% of 0 values. 

To test detection of conditional dependencies we transformed C2, C3 and C5 
in such a way, that we replaced each of the informative binary attributes with 
two attributes, which are XOR of the original attribute (e.g., A-c2-90 is replaced 
with Xl-c2-90 and X2-c2-90, where their values are assigned in such a way that 
the parity bit of the two attributes equals the value of A-c2-90). We call the 
transformed problems C2x, C3x, and C5x, respectively. 

To observe how the distribution of class values influences the evaluation mea- 
sures we formed two versions of each problem, one with uniform distribution of 
class values (data sets with suffix ’u’) and one with imbalanced distribution of 
class values (data sets with suffix ’i’), so altogether 12 data sets. Distribution of 
class values and characteristics of the problems are given in Table 1. 

We begin our analysis with two class problems. Note that for all measures 
higher score means better attribute, but the scores are not comparable between 
measures or across problems. 

Left-hand side of Table 2 gives evaluations for the problem where class val- 
ues are uniformly distributed (C2u problem). All the measures give expected 
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rankings, i.e, attributes identifying values in 90% of the cases have higher scores 
than 70% attributes. All informative attributes were assigned higher scores than 
Umax, which is the highest score assigned to one of the five random attributes. 
If its value is larger than the value of some informative attribute that attribute 
is indistinguishable from random attributes for the respective measure. 

Right-hand side of Table 2 contains evaluations for the two class, imbalanced 
problem (C2i). As before impurity-based measures rank 90% attributes higher 
than 70% attributes, and they also rank higher the attributes identifying more 
probable class. ReliefF, on the contrary ranks the minority class higher. The 
reason for this as well as for the high score of random attribute (R-50) becomes 
evident if we consider the space of attributes and its role in (4). The negative 
update of nearest hits in this two cases is likely to be zero (nearest instances have 
the same values of attributes), and so the positive update of nearest misses is 
not canceled for random attributes and the attributes identifying minority class. 



Table 2. Feature evaluations for C2u and C2i. 

C2u, uniform C2i, imbalanced 

measure A-c1-90 A-cI-TO A-c2-90 A-c2-70 Rmax A-cl-90 A-cl-70 A-c2-90 A-c2-70 Rmax 

Gain ratio 0.171 0.022 0.193 0.027 0.001 0.078 0.007 0.031 0.017 0.002 

DKM 0.110 0.015 0.122 0.018 0.001 0.045 0.007 0.039 0.020 0.003 

MDL 0.149 0.018 0.164 0.022 -0.003 0.041 0.002 0.026 0.012 -0.002 

ReliefF 0.156 0.033 0.130 0.029 0.008 0.185 0.137 0.301 0.183 0.141 



Similar results for three class problems are collected in Table 3. Due to space 
constraints we omit results for A-cl-70 and A-c2-70, as they show similar trend 
than A-c3-70, but are always assigned higher scores. With uniform class distribu- 
tion (left-hand side of the table) all measures except DKM separate informative 
from random attributes and rank 90% attributes higher than 70% attributes. 
The values of DKM are completely uninformative (after the split the probability 
of the majority class is around 0.5, giving high impurity impression). With im- 
balanced class distribution (p(0)=0.8, p(l)=0.15, p(2)=0.05; right-hand side of 
the table), all measures rank attributes identifying more frequent classes higher 
than attributes identifying less frequent classes, 90% attributes higher than 70% 
attributes, and do not distinguish between A-c3-70 and random attribute with 
maximal score. ReliefF improves its behavior compared to two class problems, 
because of more attributes (distances are larger and hits start to normalize the 
excessive contributions of the misses) and because of its normalizing factor for 
misses in (4). We get similar results and trends for 5 class problems so we skip 
the details. 

In all problems where informative attributes are replaced with two XOR- 
ed attributes (C2xi, C2xu, C3xi, C3xu, C5xi, C5xu) the impurity functions do 
not differentiate between informative and random attributes, while ReliefF does, 
except for 70% attributes and the best random attribute (R-50). As it is well 
established fact that ReliefF can detect attributes with strong interactions and 
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Table 3. Feature evaluations for C3u and C3i. 

C3u, uniform C3i, imbalanced 

measure A-c1-90 A-c2-90 A-c3-90 A-c2-70 Rmax A-cl-90 A-c2-90 A-c3-90 A-c3-70 Rmax 

Gain ratio 0.138 0.118 0.121 0.029 0.002 0.223 0.066 0.028 0.002 0.002 

DKM -0.055 -0.053 -0.054 -0.038 -0.014 0.121 0.040 0.002 0.001 0.004 

MDL 0.123 0.103 0.109 0.021 -0.001 0.149 0.057 0.019 -0.006 -0.000 

ReliefF 0.108 0.093 0.086 0.027 0.006 0.262 0.191 0.127 0.094 0.096 



impurity based measures cannot this is an expected result but shows that this 
ability exists in the imbalanced data sets as well. We skip the details. 

The attribute evaluation measures we described so far did not take cost 
information into account. Surely, if such information is available we want that 
measures take it into account and give higher scores to attributes identifying 
classes whose misclassification cost is higher. We present such measures in the 
next section. 



4 Implanting Cost-Sensitivity 



There are different techniques how to incorporate cost information into learning. 
The key idea is to use expected cost of misclassification [1,13]. Following [8], we 
define expected cost of misclassifying an example that belongs to the i-th class 
as 

(6) 

and than change the probability estimates for class values: 



p'{n) 



p{n)e^ 

HUiPi.Tj)ej 



(7) 



We use (7) in (1) and (2) to make Gain ratio and DKM cost sensitive. In (1) 
conditional probabilities p{Ti\aij) are also computed in the spirit of (7). We call 
the respective measures GRatioC and DKMc. This adaptation has the same 
effect as sampling the data proportionally to (7). MDL uses length of the code 
instead of probabilities, so we cannot use this approach, but we can sample the 
data according to (7) and run MDL (3) on the resulting data set. The resulting 
measure is referred to as MDLs. 

For two class problems [8] have adapted Relief^ to use cost by changing its 
update formula^: 



W, = W,- diff(A„ g)/m + - diff(4„ M))/m . (8) 

Ej=lP(T:/)Sy 

^ Relief uses one nearest hit H and one nearest miss M, so we use diff instead of con. 
^ This formula was typeset incorrectly in [8] (confirmed by M. Kukar, personal com- 
munication). Eq. (8) is the correct version which was actually implemented. 
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This adaptation (called ReliefK in results below) is tailored for two class prob- 
lems. As we were not satisfied with its performance on multi-class problems we 
tried different multi-class extensions and used p'{Ti) instead of p{Ti) in (4). We 
denote this extension with ReliefFp'. If we use just the information from cost 
matrix and do not take prior probabilities into account, similarly to (6) and (7), 
we compute average cost of misclassifying an example that belongs to the i-th 
class as 

a* = ■ (9) 

The prior probability of class value becomes 

p{n) = . (10) 

a. 

We use p{Ti) instead of p{Ti) in (4) and call this version ReliefFp. For two class 
problems ReliefF, ReliefFp', and ReliefFp are identical. 



Another idea how to use the cost information stems from the generalized form 
of ReliefF [12]: 

Wf = ^ similarity(r, /*,/„)• similarity (A^,, /(,/„) , 



where It and are appropriate samples drawn from the instance population 
I. For attribute similarity ReliefF uses negative diff function (5) and for class 
similarity it uses 



similarity(r, It, Ju) 



1 ^U,T 1 

f lu^T J 



(11) 



which together gives exactly updates for hits and misses in original Relief. The 
obvious place to use cost information is therefore (11), which affects the update 
formula (4). We used cost information in the form of expected and average cost. 
Using the expected cost, the contribution of class differences in hits costs SHi 
and different class of miss prevents the actual cost, so (4) changes to 



Wy = Wy - SEt^yC 0 n(Ay, , H)/m 



+ E 

t=i 



p{Tt)C{Rj^r, Tt)con{Ay, Ri,M{t))/m 
1 -p{Ri^r) 



( 12 ) 



We call this measure ReliefFeC. While its updates are symmetric for hits and 
misses, note that they are not normalized to [0,1], so the scores of the attributes 
are not necessary normalized to [-1,1] . If we use just cost information (no priors) 
then we can use average cost of misclassification (ReliefFaC variant) 



Wy = Wy- aR. ^con{Ay, R„ H) /m 



+ E 

t=i 



C{Ri^r,Tt)con{Ay, Ri,M{t))/m 



c — 1 



( 13 ) 
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For two class problems ReliefFec and ReliefFac are identical. 

We assumed that C{i,i) = 0, i.e., that predicting correct class implies no 
cost. If we are using benefit matrix instead of cost matrix, this is usually not the 
case, and we suggest using actual C{i, i) instead of expected and average cost as 
normalizing factor for hits in (12) and (13). 

Alternatively, instead of using costs directly, we can change the sampling to 
reflect the cost matrix as in [1,10]. While this approach may not reflect all the 
details of cost matrix, it may still work well in practice. We made sampling of 
random instances of class j in ReliefF (line 3 on Figure 1) proportional to (7). 
The resulting measure is called ReliefFs. In the next section we test how these 
measures exploit cost information. 



5 Using Cost Information 

Following the arguments of [4] and [9] not all cost matrixes are sensible and 
realistic. We try to test our measures with realistic cost matrixes, e.g., detect- 
ing exception for C2, progressive health risk for C3 and financial loss for C5 
problems: 
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The right-most column of Table 1 presents the probability distributions by (7) 
computed from the given class distributions and cost matrixes. 

In Table 4 we give results for two and three class problems with imbalanced 
distribution (C2i and C3i). Uniform distribution is nonrealistic with cost ma- 
trix information, so we skip these results. A-70mm denotes 70% attribute with 
minimal score and Rmax random attribute with maximal score. 

For two class problem C2i (left-hand side of Table 4) MDLc and all variants 
of ReliefF reflect cost-sensitivity, i.e., they evaluate A-c2-90 as better than A- 
cl-90 (c2 has higher cost assigned, so attribute identifying it is more useful). 
These measures also separate 90% attributes from the random attributes. Only 
MDLs separates 70% attribute with minimal score from random attributes for 
2 class problems, while none of the measures cannot do that for 3 and 5 class 
problems, which means, that random attributes are more difficult to detect in 
the cost-sensitive context. GRatioC and DKMc are also cost-sensitive but fail to 
separate A-cl-90 from the random attributes. 

For three class problem C3i (right-hand side of Table 4) MDLs, ReliefFp, 
ReliefFp' and GRatioC are the most cost-sensitive (which can be seen by com- 
paring results with Table 3), followed by ReliefFeC and ReliefFac. ReliefFs and 
ReliefK are cost-sensitive to a lesser extent. DKMc once again fails completely 
for multi-class problems as the changed probability distribution moved towards 
the uniform distribution. 
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Table 4. Cost-sensitive feature evaluations for C2i and C3i. 



C2i C3i 



measure 


A-cl-90 


A-C2-90 


A-70^in Rmax 


A-cl-90 


A-C2-90 


A-C3-90 


A-70^in 


Rmax 


GRatioC 


-0.029 


0.114 


0.000 0.000 


0.177 


0.132 


0.169 


0.009 


0.020 


DKMc 


-0.007 


0.083 


0.000 0.000 


-0.035 


-0.032 


-0.032 


-0.029 


0.000 


MDLs 


0.095 


0.189 


0.021 -0.000 


0.185 


0.106 


0.144 


0.003 


0.007 


ReliefK 


0.078 


0.107 


0.000 0.034 


0.125 


0.034 


0.044 


0.008 


0.018 


ReliefF p' 


0.185 


0.306 


0.137 0.141 


0.286 


0.200 


0.195 


0.136 


0.133 


ReliefF p 


0.185 


0.306 


0.137 0.141 


0.306 


0.208 


0.252 


0.171 


0.166 


ReliefFeC 


0.236 


0.335 


-0.125 -0.072 


0.405 


0.029 


0.092 


-0.132 


-0.020 


ReliefFaC 


0.236 


0.335 


-0.125 -0.072 


0.352 


0.105 


0.182 


-0.001 


0.000 


ReliefFs 


0.083 


0.123 


-0.045 -0.025 


0.167 


0.025 


0.050 


-0.049 


-0.006 



These findings are even more radical for the five class problem C5i, where 
only ReliefFp, ReliefFp', MDLs and GRatioC can separate 90% attributes from 
random ones. ReliefFeC and ReliefFaC use (6) and (9) to normalize its hits so 
they are less stable when large differences between entries in cost matrix are 
not reflected by sufficiently large number of instances. ReliefFs also suffers from 
insufficient number of instances, while ReliefK is not properly normalized for 
multi-class problems. 

In problems with XOR-ed attributes ReliefF based measures are cost-sensitive 
and can differentiate between informative and random attributes, while impurity 
based measures cannot. 

6 Conclusions 

We have investigated the performance of common attribute evaluation measures 
in problems where the class distribution is imbalanced and in problems with 
unequal misclassification costs. For that matter we constructed several data sets 
and adapted existing measures. Impurity based measures were adapted by in- 
cluding expected misclassification costs into class probabilities or through sam- 
pling. Adaptations of ReliefF stemmed from the expected misclassification cost, 
average misclassification cost, general form of ReliefF, and cost stratified sam- 
pling. 

Imbalanced data sets cause no problems to Gain ratio, MDL and ReliefF, 
while DKM works only for two class problems. Only ReliefF detects highly de- 
pendent attributes. 

In problems with unequal misclassification costs only MDLs and two variants 
of ReliefF, which use probability estimates (7) and (10) in the update formula 
(4), reliably exploit information from cost matrix. Gost-sensitive adaptation of 
Gain ratio fails to detect all important attributes in two class problem, while 
DKM is useless for multi-class problems. ReliefF variants retain its ability to 
detect highly dependent attributes. 

While feature evaluation measures need not be cost-sensitive for decision 
tree building, in further work we want to test this hypothesis the presented 
measures. We will also investigate feature selection and weighting in the cost- 
sensitive context. 
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Abstract. In this paper we propose a population based optimization 
method that uses the estimation of probability distributions. To repre- 
sent an approximate factorization of the probability, the algorithm em- 
ploys a junction graph constructed from an independence graph. We 
show that the algorithm extends the representation capabilities of previ- 
ous algorithms that use factorizations. A number of functions are used to 
evaluate the performance of our proposal. The results of the experiments 
show that the algorithm is able to optimize the functions, outperforming 
other evolutionary algorithms that use factorizations. 

Keywords: Genetic algorithms, EDA, FDA, evolutionary optimization, 
estimation of distributions. 



1 Introduction 

In the application of Genetic Algorithms (GAs) [4] to a wide class of optimiza- 
tion problems is essential the identification and mixing of building blocks. It has 
been early noticed that the Simple GA (SGA) is in general unable to accomplish 
these two tasks for difficult problems (e.g. deceptive problems). Perturbation 
techniques, linkage learners and model building algorithms are among the alter- 
natives proposed to improve GAs. They try to identify the relevant interactions 
among the variables of the problem, and to use them in an efficient way to search 
for solutions. 

Estimation Distribution Algorithms (ED As) [10] are evolutionary algorithms 
that do not use the crossover and mutation operators. They construct in each 
generation a probabilistic model of the selected solutions. The probabilistic 
model must be able to capture a number of relevant relationships in the form 
of statistical dependencies among the variables. Dependencies are then used to 
generate solutions during a sampling step. It is expected that the generated so- 
lutions share a number of characteristics with the selected ones. In this way the 
search is led to promising areas of the search space. The interested reader is 
referred to [5] for a good survey that covers the theory, and a wide spectrum of 
ED As applications. 

One efficient way of estimating a probability distribution is by means of fac- 
torizations. A probability distribution is factorized when it can be computed as 
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the product of functions defined on subsets of the variables. A subclass of ED As 
includes the algorithms that use factorizations of the probability distribution. In 
this paper we call to this subclass Factorized Distribution Algorithms (FDAs)^ 

[9]. 

FDAs have outperformed other evolutionary algorithms in the optimization 
of complex additive functions, and deceptive problems with overlapping variables 
[9]. However, a shortcoming of FDAs is that the probabilistic model they are 
based on is constrained to represent a limited number of interactions. In this 
paper we investigate the issue of extending the representation capabilities of 
FDAs. To this end we introduce the Markov Network FDA (MN-FDA), a new 
type of FDA based on an undirected graphical model, and able to represent the 
so called ’’invalid” factorizations [9]. 

The paper is organized as follows. In section 2 we discuss the problem of ob- 
taining a factorization of the probability. Section 3 presents the main steps for 
learning an approximate factorization from data. Section 4 explains the way the 
sampling step has been implemented. We introduce the MN-FDA in section 5. 
Section 6 presents the functions used in our experiments. We discuss the numer- 
ical results of the simulation. Section 7 analyzes the MN-FDA in the context 
of recent related research on evolutionary computation, we also present in this 
section the conclusions of our paper. 

2 Factorization of a Probability 

The central problem of FDAs is how to efficiently estimate a factorization of 
the joint probability of the selected individuals. To compute a factorization the 
theory of graphical models is usually employed. The following definitions will 
help in the explanation of our proposal. 

Let X = {Xi, X 2 , ■ ■ ■ ,Xn) represent a vector of integer random variables, 
where n is the number of variables of the problem, x = {x\,X 2 t - ■ ■ ,Xn) is an 
assignment to the variables, and p(x) is a joint probability distribution to be 
modeled. Each variable of the problem has associated one vertex in an undirected 
graph G = (V,E). The graph G is a conditional independence graph respect to 
p(x) if there is no edge between two vertices whenever the pair of variables is 
independent given all the remaining variables. 

Definition 1. Given a graph G, a clique in G is a fully connected subset ofV. 
We reserve the letter G to refer to a clique. The collection of all cliques in G is 
denoted as C. G is maximal when it is not contained in any other clique. G is 
the maximum clique of the graph if it is the clique in C with the highest number 
of vertices. 

Definition 2. A junction graph (JG) of the independence graph G is a graph 
where each node corresponds to a maximal clique of G, and there exists an edge 

^ In the literature the term FDA is frequently used to name a particular type of 
Factorized Distribution Algorithms. Our definition covers it, and other algorithms 
that use factorizations. 
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between two nodes if their corresponding cliques overlap. A labeled junction graph 
is a JG that has an associated ordering of the nodes with a distinguished node 
called the root, and satisfies that a node belongs to the graph if at least one the 
variables in the cliques is not contained in the previous nodes in the ordering. 



Definition 3. A junction tree (JT) is a single connected junction graph. It 
satisfies that if the variable Xk is a member of the junction tree nodes i and j , 
then Xk is a member of every node on the path between i and j . This property 
is called the running intersection property. 

If the independence graph G is chordal, an exact factorization of the proba- 
bility, based on the cliques of the graph, exists. The factorization can be repre- 
sented using a JT. If G is not chordal, a chordal super-graph of G can be found 
by adding edges to C? in a process called triangulization. The problem is that 
we can not guarantee that the maximum clique of the super-graph will have a 
size that would make feasible the calculation of the marginal probabilities. The 
problem of finding a triangulization with maximum clique of minimum size is 
NP-complete. 

Our goal is to find an approximate factorization that contains as many de- 
pendencies as possible, but without adding new edges to the graph. An exact 
factorization would comprise all the dependencies represented in the indepen- 
dence graph. We will assume that approximate factorizations of the probability 
are more precise as they include more of the dependencies represented in the 
independence graph. The approximate factorization will be represented using a 
labeled JG. The algorithm for learning the probabilistic model has five main 
steps. 



Algorithm 1: Model learning 

1. Learn an independence graph G from the data (the selected set of solutions). 

2. If necessary, refine the graph. 

3. Find the set L of all the maximal cliques of G. 

4 . Construct a labeled JG from L. 

5. Find the marginal probabilities for the cliques in the JG. 



3 Learning an Approximate Factorization 

In this section we consider in detail the different steps for learning an approxi- 
mate factorization from data. 

3.1 Learning of an Independence Graph 

The construction of an independence graph from the data can be accomplished 
by means of independence tests. To determine if an edge belongs to the graph, 
it is enough to make an independence test on each pair of variables given the 



340 



Roberto Santana 



rest. Nevertheless, from an algorithmic point of view it is important to reduce the 
order of the independence tests. Thus, we have adopted the methodology followed 
previously by Spirtes [12]. The idea is to start from a complete undirected graph, 
and then try to remove edges by testing for conditional independence between 
the linked nodes, but using conditioning sets as small as possible. 

To evaluate the independence tests we use the Chi-square independence test. 
If two variables Xi and Xj are dependent with a specified level of significance 
a, they are joined by an edge, a is a parameter of the algorithm. In the general 
case we can assume that each edge i ^ j in the initial independence graph 
is weighted with a value w{i,j) stressing the pairwise interaction between the 
variables. This information might be available from prior information, or from 
the statistical tests conducted on the data (the value of the chi-square test). 
When such information is not available we assume that all the values of the 
dependencies are equal to a parameter w' (i.e. w{i,j) = w',Vi ^ j G E). 

3.2 Refinement of the Graph 

When the independence graph is very dense, we can expect that the dimension 
of the cliques will increase. An alternative to solve this problem is, in a step 
previous to the calculation of the cliques, to make the graph sparser. One way 
of doing this is allowing a maximum number of incident edges to each vertex. 
If the vertex has more than r incident edges, those with the lower weights are 
removed. In this way the size of the maximum clique will be always smaller or 
equal than r. Our refinement algorithms avoids introducing a bias in the way the 
edges are removed. However, it has a main drawback: it could be the case that 
there exist more than MaxEdges variables depending from a single one, but 
the maximum clique where this variable is included be smaller than r. In this 
case, the procedure that eliminates the edges would remove dependencies from 
the graph without a real need to do so. We have not found a better practical 
solution to this problem. 

3.3 Maximal Cliques of the Graph 

To find all the cliques of the graphs the Ken and Kerbash algorithm [1] is used. 
This algorithm uses a branch and bound technique to cut off branches that can 
lead to cliques. Once all the cliques have been found they are stored in a list L, 
and their weights are calculated from the information about dependencies. The 
weight of any subgraph G' of G is calculated as W{G') = J2i^jeG' In 

this way the weights of the maximal cliques w{Gi) are calculated. 

3.4 Construction of the Labeled JG 

Algorithm 2 receives the list of cliques L with their weight, and outputs a list L' 
of the cliques in the labeled JG. The first clique in L' is the root, and the labels 
of cliques in the labeled JG correspond to their position in the list. Each clique 
in the labeled JG is a subset of a clique in L. 
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Algorithm 2; Algorithm for learning a JG 

1. Order the cliques in L in a decreasing order according to their weights 

2. Add element L(l) to list L' 

3. Remove element 1/(1) from L 
4- While L is not empty 

5. Find the first element G in L such that Gn(L'(l)UL'(2) • • -VJL' {N Cliques)) ^ 
C, and the number of variables in G 0 L'{1) n L' {2) • • • n L' {N Cliques) is 
maximized 

6. If G = 0 Remove all the elements in L 

7. else Insert G in L' 



We focus now on step 5 of algorithm 2. The condition of maximizing the 
number of variables in C fl {L'{\) U L'{2) • • • U L' {N Cliques)) states that the 
clique C in L that has the highest number of overlapping variables with all the 
variables already in L' , will be added to L' . The number of overlapping variables 
has to be less than the size of the clique, constraint meaning that at least one 
of the variables in C has not appeared before. If there exist many cliques with 
maximum number of overlapped variables, the one that appears first in L is 
added to L' . On the other hand, if the maximum number of overlapped variables 
is zero, then there exists in the JG more than one connected component. In this 
case we have a set of junction graphs, however we have preferred to abuse the 
notation and call it JG, whether it has one or more connected components. 
Finally, the addition of cliques stops when all the variables are already in the 
JG. 

Marginal probabilities are found by calculating the number of counts associ- 
ated to each configuration, and normalizing. In the implementation, the learned 
model’s parameters can be changed by adding a perturbation in the form of 
probabilistic priors. 



3.5 Description of an Example of the Learning Algorithm 

We introduce an example of the application of algorithm 1. The information 
about the dependencies among the 12 variables of a given problem is represented 
by the independence graph shown in figure 1 (left). Let us suppose that the 
maximum number of incident edges allowed is 6. In the refinement step only 
edges incident to the vertex has to be removed. If information were available 
about the dependencies of each link, the two edges with the weakest dependencies 
would b e removed. In the present example we assume that all the dependencies 
are equally strong, and two arbitrary edges {x\ ^ and x^ ^ sio) are removed. 
The refined graph is shown in figure 1 (right) . 

In the next step all the maximal cliques of the graph are found. There are 
9 maximal cliques, all of order 3. Also in this case the cliques have the same 
weight, therefore we arbitrarily select the clique {x\,X 2 ,x^) as the root. 
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Fig. 1. Original and refined independence graphs. 





Fig. 2. Ordered junction graph and junction tree. 



Construction of the ordered JG: In the first step the clique with maximum over- 
lapping with all the variables already in the ordered JG is {x2,X3,X5). In the 
next step either of the cliques {x2,Xi,X5) or {x3,X5,xq) can be incorporated. 
Figure 2 (left) shows the final JG. In the cliques shown in the figure, each new 
variable incorporated to the graph is represented to the left of the bar. Only 
eight of the cliques are included in the ordered JG, the clique (x5,xy,xs) is 
missing. Its absence is explained because the algorithm for finding the ordered 
JG can not guarantee that all the dependencies will be captured. On the other 
hand, the factorization represented by the labeled JG is invalid because there 
exists a cycle comprising different cliques. 

Construction of the JT: Figure 2 (left) shows the JT obtained using the al- 
gorithm. Notice that the JT can represent less dependencies than the JG. As 
the JT prohibits the existence of cycles, the clique (xio, X7, xg) can not be fully 
represented. 
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4 Sampling of the Approximate Factorization 

Points are sampled from the labeled JG following the order determined by the 
labels. The variables corresponding to the first clique in the JG are instantiated 
sampling from the marginal probabilities. For the rest of cliques, each subset of 
variables that has not been instantiated is sampled conditionally on the vari- 
ables already instantiated that belong to the clique. The process is very similar 
to Probabilistic Logic Sampling (PLS) [3], when it is used in junction trees. 
There exists however an important difference. The definition of JT discards the 
existence of cycles. A labeled JG can contain cycles, and this fact allows the 
representation of more interactions, but it does not essentially change the per- 
formance of the sampling algorithm. The reason is that in every step of the JG 
sampling algorithm, the conditioning and conditioned subsets of variables belong 
to the clique whose variables are being sampled. 



5 MN-FDA 

Our algorithm is called Markov Network FDA (MN-FDA), its pseudo-code is 
presented in algorithm 3. The main difference between it and previous FDAs 
based on undirected models is that it uses as its probabilistic model a labeled 
JG while previous FDAs based on undirected graphical models [9,11] represent 
the factorizations using a JT. 



Algorithm 3: MN-FDA 

1. Set t 0. Generate A 0 points randomly. 

2 . do { 

3. Select a set S oi k < N points according to a selection method. 

4- Learn a labeled JG from the data. 

5. Calculate the marginal probabilities for all the cliques in the JG. 

6. Generate a the new population sampling from the JG. 

7. t^t+1 

8. } until Termination criteria are met 



In all the experiments presented in this paper the algorithm used to learn 
the independence graph only considered independence tests up to third order. 
The level of significance a was set to 0.75. This choice was motivated by the 
need of capturing as many dependencies as possible. Even if some of the found 
dependencies might be false, this is preferable that missing some of the real 
dependencies. The number of allowed neighbors for the refinement algorithms 
was set to 8. 
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5.1 Computational Cost of the MN-FDA 

The number of operations needed to make the iirdependeirce tests is ripper 
bounded by 0{Nn^). The worst complexity of the refinement algorithm is 
bounded by 0{n?log{n)). It has been calculated considering the case when after 
the independence tests, the graph remains complete. The time complexity of 
the Bron and Kerbosch algorithm is not calculated in their original work [1]. 
However from comparisons with other algorithms for which bounds have been 
calculated the worst comp lexity of the algorithm can be estimated as 0(/i^), 
where /r is the number of maximal cliques. When there are at most k edges for 
each variable and k << n, a, bound for the number of cliques can be given by 
kn, and the complexity of the Bron and Kerbosh algorithm roughly estimated as 
~ 0{n?). The complexity of learning the parameters depends on the size 
of the population N, the number of cliques g,, and their size. The order of this 
step is 0{N fi) Ri O(Nn^). Th e total complexity of the MN-FDA is 0{Nn^). 



6 Experiments 

In our experiments we compare the behavior of the MN-FDA with the following 
FDAs: The FDA* [9] , it uses a fixed model of interactions, only the parameters of 
the cliques are learned in each generation. The Univariate Marginal Distribution 
Algorithm (UMDA) [6], which uses a model that assumes all the variables are 
independent. In every step, the algorithm makes a parametric learning of the 
univariate probabilities. The Tree-FDA [11] uses a probability model where each 
variable is conditioned on at most one parent. The Learning FDA (LFDA) [8] is 
a FDA that uses a Bayesian probabilistic model. 

First, a number of functions commonly used to evaluate evolutionary al- 
gorithms are presented. A practical problem used in the experiments is also 
described. All the problems used in the experiments are defined on binary vari- 
ables. The numerical results and the analysis of the experiments are presented 
afterward. 



6.1 Functions Used in the Experiments 

Deceptive functions were introduced by Goldberg to show the deceptive nature 
of the GAs behavior, and to address the problems given by the convergence 
to local optima of the function. The following 4 elementary deceptive functions 
of k variables are used to define some of the additive functions used in our 
experiments. They are defined in terms of the rmitation value u{x) = 

Each entry of the table shows the evaluation of the corresponding deceptive 
function when the unitation value of the subset of variables it is evaluated on 



IS u. 
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u 


0 


1 


2 


3 


4 


5 


f3 

J dec 


0.9 


0.8 


0 


1 






r4 

J dec 


3 


2 


1 


0 


4 




IsoTi 


m 


0 


0 


0 


0 


0 


IS0T2 


m 


0 


0 


0 


0 


m — 1 



n 

Onemax{x) = Xi 
2 = 1 



fsdeceptivei^^) — E /dec (^32— 2 ; ^32—1 1 ^ 32 ) 

2=1 

j—Tk 

Deceptive4{x) = E /*rfec(^4i— 3; ^4i—2: ^4z— li ^4i') 

i=l 



( 1 ) 

(2) 

(3) 



Fisop{n,m,k,x) = ( 4 ) 

+ fc • (m + 1)((1 - Xi) • • • (1 - Xm)Xm+l ■ ■ ■ Xn) 

When analyzing interactions between variables it is important to consider 
interactions that do not depend on the linear codification of solutions. To this 
end we considered function FjsoTorus (5) where xieft, etc., are defined as 
the appropriate neighbors, wrapping around. 



Fj soTorusi^x') — , Xi , X2 , Xi+TTj) 

E ti 

IsoT2{Xup : Xleft: Xi^ Xright: Xdown') (b) 

Function Big Jump (6) was introduced in [7]. A valley has to be crossed in 
order to reach the global optimum of this function. The bigger the parameter m 
is for this function, the wider the valley, k can be increased to give bigger weight 
to the maximum. 

{ u for 0 < u < n — m 

0 for n — m < u < m (6) 

k ■ n for u = m 

The generalized Ising model (7) is described by the energy functional (Hamil- 
tonian) where L is the set of sites called a lattice. Each spin variable Ui at site 
i € L either takes the value 1 or value —1. A specific choice of values for the 
spin variables is called a configuration. The constants are the interaction 

coefficients. In our experiments we take hi = 0, Vi G L. The ground state is the 

configuration with minimum energy. 

F[ = ^ ^ Jij(Ti(J j ^ ^ hiCTi 

i<jSiL ieL 



( 7 ) 
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6.2 Numerical Results 

In all the experiments we use the truncation selection. In the following tables 
n is the number of variables, N is the population size, succ is the number of 
times the optimum was reached in 100 experiments, gen the average number of 
generations to convergence, / the average fitness of the best found solutions, and 
eval is the average number of evaluations needed to find the optimum. 



Table 1. Comparison between the MN-FDA with other FDAs for unitation functions. 
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BigJump{30, 3, 1) 
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30 
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100 
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30 
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58 


32 
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81 


30 


LFDAq.5 


100 


38 


30 
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200 


96 


32 


LFDAo.25 


800 


92 


30 
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80 


30 
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200 


100 
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30 
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71 
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400 


100 


32 


LFDAo.75 


800 


12 


30 


MN-FDA 


30 


72 


30 


MN-FDA 


100 


92 


32 


MN-FDA 


600 


90 


30 


MN-FDA 


100 


98 


30 


MN-FDA 


200 


100 


32 


MN-FDA 


800 


100 



In table 1 results of the MN-FDA for different functions are compared with 
results published in [7] for the UMDA and the LFDA. For the functions consid- 
ered in our experiments, results of the LFDA were available in [7] only for the 
values of the LFDA parameter a presented in the table^. In these functions the 
MN-FDA achieved equal or better results than the LFDA. We have observed 
that the learning algorithm used by the MN-FDA easily detects variables that 
are independent. The BN learning algorithms used by Bayesian FDAs may have 
problems recognizing independence, particularly if a is small. 

In table 2 we have included the results for the UMDA, the Tree-FDA, and the 
LFDA for other functions. In all the cases N = 1000, the truncation parameter 
is 0.15 and the maximum number of generations is 25. For the LFDA, a = 0.5. 

For function fsdecepUve the results of the MN-FDA are the best. For function 
FisoTorus LFDA finds the optimum more times than the MN-FDA, however its 
average fitness is lower. For function FjsoP the LFDA achieved the best results, 
although the difference is not as significant as in the case of the fsdeceptive- The 
UMDA was not able to solve the problems with interactions. 

We have generated 4 random instances of the Ising model for different number 
of variables (n G {25,36,49,64}). For each of the instances we investigate two 
different issues. First, the influence of using the prior information about the 
interactions of the variables. MN-FDA® is a Markov Network FDA that does not 
learn the independence graph from the data. In this case, the lattice where the 
Ising model is defined serves as the independence graph. The maximum size of the 

^ Parameter a has different meanings in the LFDA and the MN-FDA, although in the 
LFDA it also serves to specify the density of the Bayesian network. 
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Table 2. Comparison between the MN-FDA with other FDAs for different functions. 
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65 
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45 
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Table 3. Results of the MN-FDA for different Ising instances. 
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700 


67 
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28 
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cliques is equal 2. The second issue we study is the scaling of the algorithm when 
the population size is fixed to 1000, and the coefficient of truncation selection is 
T = 0.15. 

The results of these experiments are shown in table 3. An analysis of the 
results reveals the convenience of using prior information about the optimization 
problem for increasing the efficiency of the MN-FDA. The small population size 
that is enough for the convergence of the MN-FDA® is not sufficient for the MN- 
FDA. As expected, when the number of variables increases, a higher population 
size is needed to solve the problem. 

7 Conclusions 

In this paper we have presented a FDA that approximates the probability distri- 
bution determined by selection using a labeled JG. The JG is found by calculat- 
ing the maximal cliques of a Markov Network that can be given as an input or 
learned from the data. Our work is related with previous work by Muehlenbein 
et al. [9] , where approximate factorizations were recognized as an alternative for 
modeling probabilistic distributions. Our research, that has led to a different 
way of finding these approximations, is also related with the work presented by 
Brown et al. [2] in the application of MRFs to GAs. They have used probabilis- 
tic models of GA fitness functions to generate new solutions. Our work shows 
a number of relevant differences with this approach: The use of statistical tests 
to learn the structure of interactions. In [2] the structure of the interactions is 
known a priori. The construction of the JG from the MN, and the use of PLS 
on the JG. In [2] the Metropolis algorithm is used to generate new solutions. 

The results of our experiments show that the MN-FDA is able to optimize 
theoretical functions as well as functions derived from practical problems, out- 
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performing other evolutionary algorithms. The MN-FDA generalizes other FDAs 
by learning factorizations that have not to be valid. More theoretical investiga- 
tion is needed to determine bounds for the convergence of the MN-FDA. Other 
practical optimization problems must be tried to assess the performance of the 
algorithm. 



References 

1. C. Bron and J. Kerbosch. Algorithm 457 — finding all cliques of an undirected 
graph. Communications of the ACM, 16(6):575-577, 1973. 

2. D. F. Brown, A. Garmendia-Doal, and J. A. W. McCall. Markov random field 
modelling of royal road genetic algorithms. In P. Collet, editor. Proceedings of EA 
2001, volume 2310 of Lecture Notes in Computer Science, pages 65-76. Springer 
Verlag, 2002. 

3. M. Henrion. Propagating uncertainty in Bayesian networks by probabilistic logic 
sampling. Uneertainty in Artificial Intelligence, 2:317-324, 1988. 

4. J. H. Holland. Adaptation in natural and artificial systems. University of Michigan 
Press, Ann Arbor, MI, 1975. 

5. P. Larranaga and J. A. Lozano. Estimation Distribution Algorithms. A new tool 
for Evolutionary Optimization. Kluwer Academic Publishers, Boston/Dordrecht/ 
London, 2001. 

6. H. Muhlenbein. The equation for response to selection and its use for prediction. 
Evolutionary Computation, 5(3):303-346, 1997. 

7. H. Muhlenbein and T. Mahnig. Theoretical Aspects of Evolutionary Computing, 
chapter Evolutionary Algorithms: From Recombination to Search Distributions, 
pages 137-176. Springer Verlag, Berlin, 2000. 

8. H. Muhlenbein and T. Mahnig. Evolutionary synthesis of Bayesian networks for 
optimization. Advances in Evolutionary Synthesis of Neural Systems, MIT Press, 
pages 429-455, 2001. 

9. H. Miihlenbein, T. Mahnig, and A. Ochoa. Schemata, distributions and graphical 
models in evolutionary optimization. Journal of Heuristics, 5(2):213-247, 1999. 

10. H. Miihlenbein and G. Paafi. From recombination of genes to the estimation of 
distributions I. Binary parameters. In A. Eiben, T. Back, M. Shoenauer, and 
H. Schwefel, editors. Parallel Problem Solving from Nature - PPSN IV, pages 178- 
187, Berlin, 1996. Springer Verlag. 

11. R. Santana, A. Ochoa, and M. R. Soto. The Mixture of Trees Factorized Distri- 
bution Algorithm. In Proceedings of the Cenetic and Evolutionary Computation 
Conference GECCO-2001, pages 543-550, San Francisco, CA, 2001. Morgan Kauf- 
mann Publishers. 

12. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and search. Lecture 
Notes in Statistics. Springer- Verlag, New York, 1993. 



On Boosting Improvement: 

Error Reduction and Convergence Speed-Up 



Marc Sebban and Henri-Maxime Suchier 

EURISE - Universite Jean Monnet de Saint-Etienne 
23, rue du Dr Paul Michelon, 42023 Saint-Etienne cedex 2, France 
{Marc . Sebban, Henr i .Maxime . Suchier }@univ-st-etienne . f r 



Abstract. Boosting is not only the most efficient ensemble learning 
method in practice, but also the one based on the most robust theoretical 
properties. The adaptive update of the sample distribution, which tends 
to increase the weight of the misclassified examples, allows to improve 
the performance of any learning algorithm. However, its ability to avoid 
overfitting has been challenged when boosting is applied to noisy data. 
This situation is frequent with the modern databases, built thanks to 
new data acquisition technologies, such as the Web. The convergence 
speed of boosting is also penalized on such databases, where there is 
a large overlap of probability density functions of the classes to learn 
(large Bayesian error). In this article, we propose a slight modification 
of the weight update rule of the algorithm Adaboost. We show that 
by exploiting an adaptive measure of a local entropy, computed from a 
neighborhood graph built on the examples, it is possible to identify not 
only the outliers but also the examples located in the Bayesian error 
region. Taking into account this information, we correct the weight of the 
examples to improve the boosting performances. A broad experimental 
study shows the interest of our new algorithm, called zAdaboost. 



1 Introduction 

A large number of studies in machine learning have focused during the last 
decade on classifier aggregation methods, which aim at improving by voting 
techniques the performances of a single classifier. Among these methods, the 
most usually used are probably bagging [1], arcing [2], and boosting [3,4]. In this 
article, we only focus on this third approach, which received over the last few 
years a spectacular interest. The two main reasons for this growing popularity 
are probably due, on the one hand to its simplicity of implementation, and on the 
other hand to the large number of recently enacted theorems on bounds, margins, 
or boosting convergence [5,6]. Boosting is known to improve the performances 
of any learning algorithm, assumed a priori unstable, called weak learner. 
The strategy consists in successively training the algorithm (T times) on various 
probability distributions Dt{x) over the learning sample LS, and in combining 
the resulting classifiers (called weak hypotheses) in an efficient single classifier H . 
The central point of boosting, and its algorithm Adaboost [7,8] (see Algorithm 
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1), is the weight update rule. At each round, the current distribution favors the 
weights of examples mislabeled {y{x) ^ ht{x)) by the previous hypothesis, that 
characterizes well the adaptativity of AdaBOOST. 

The first experimental results have shown that boosting seems to be immune 
against overfitting. Actually, not only the empirical error on the learning sample 
decreases exponentially with the number of iterations, but also the generalization 
error drops, even when the empirical error reaches its minimum (even 0). These 
results incited over the last few years researchers to find theoretical justifications 
for this behavior. These works made it possible to establish the link between 
boosting and margin maximization^ [5], and in particular (i) to show that it is 
possible to bound the theoretical error by a term decreasing with the margin 
increase, and (ii) to prove that margins on the training examples increase with 
the boosting iterations. Parallel to these fundamental results, some studies on 
boosting are concerned with the improvement of the algorithm Adaboost, 
which is today confronted with two main problems. 



Data : A learning sample LS, a number of iterations T, a weak learner WL 
Result : An aggregated classifier H 

Initialize distribution: Vx G LS, Di(x) = ; 

for t = 2 to T do 
ht = WL (LS,Dt); 

et= Dt(x) ; at = I log(i^) ; 

e-.y{x)^ht(x) 

Distribution Update /* Vx G LS, Dt+i{x) = — Zt^ ^ */’ 

j*Zt is a Normalization Factor*/ ; 

Return H s.t. H{x) = ; 

Algorithm 1: Pseudo-code of AdaBoost. 



Firstly, the emergence of very large but often strongly noisy modern 
databases forces the researchers to study and improve the noise tolerance ca- 
pacities of boosting. Indeed, while the success of Adaboost is indisputable, 
there is increasing evidence that the algorithm is quite susceptible to noise [9], 
resulting in overfitting. Recent works then tried to limit these risks of overfit- 
ting. Adaboost tending (wrongly) to exponentially increase the weight of the 
outliers, some algorithms aim at controlling the update rule [10,11,12]. 

Secondly, the other drawback of the real world databases is not directly 
relating to boosting performances in terms of error, but rather on the convergence 
speed of Adaboost. We will see in this article that in the presence of a high 
overlap between the probability densities of the classes, the optimal error of the 
learning algorithm is reached after many iterations (T very large). In other words, 
Adaboost “loses” time, and thus iterations, by reweighting examples which 

The margin expresses the degree of confidence in the class prediction for an example. 
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theoretically do not deserve any attention, since they belong to the Bayesian 
error region. Such instances are frequent in the modern databases, which often 
present a non-null Bayesian error. The rare studies aiming at increasing the 
convergence speed [13] are more theoretical than usable in practice. 

In this article, we deal with the two problems previously mentioned. We 
propose a modification of the weight update rule which (i) keeps the exponential 
function to not challenge the error bounds, (ii) but avoids applying it to noisy 
data and (iii) sets to zero the weights of examples in Bayesian region. In order 
to detect, at a given step t, not only the noisy data but also those at the border 
of the classes, we use information contained in a neighborhood graph built on 
the learning set. The graph is constructed only once, and only the node weights 
vary during the boosting procedure. We compute from the neighborhood of each 
example an entropy allowing to evaluate the level of local information. This non- 
monotone function thus makes it possible to assess, with the current weights, 
if each example deserves or not to be reweighted. Our procedure consists in 
assigning to each example a coefficient which estimates, in a way, the confidence 
that one can have in the new weight Dt+i{x) calculated in Adaboost. 

This article is organized as follows. In Section 2, we present a synthetic state 
of the art of the main approaches, past or in progress, dealing with boosting 
improvement. Section 3 is devoted to the presentation of our new weight update 
rule, and to our algorithm jAdaboost. In Section 4, we carry out a broad 
experimental study aiming at comparing, on many databases, the performances 
of Adaboost and i Adaboost, in terms of error and convergence speed. 

2 State of the Art 

In this section, we carry out a survey on the main methods aiming at improving 
boosting. We divide them in two categories according to their goal: managing 
noisy data and dealing with the speed of convergence. 

2.1 Noise Tolerance 

An experimental study was presented in [9], aiming in particular at comparing 
the performances of boosting and bagging, on a large number of benchmarks. The 
main information highlighted in this study is the weak tolerance of both methods 
to noisy data. More surprisingly, Adaboost reaches a higher error rate than 
bagging. Dietterich gives a convincing explanation of the reason of this behavior. 
He shows that boosting tends to assign the examples to which noise was added 
much higher weight than other instances. As a result, hypotheses generated in 
later iterations cause the combined hypothesis to overfit the noise. In order to 
illustrate this phenomenon, Adaboost was run on an a priori linearly separable 
sample, artificially corrupted by a given percentage of noise (see Fig. 1). We used 
stumps (decision trees with only a root node) to learn the data. Stumps have all 
the characteristics of good weak hypotheses [10] because they have, according to 
the bias-variance terminology, a low variance and a high bias. Fig. 1 also shows 
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Fig. 1. An example of overfitting on noisy data. Adaboost finds the right separator as 
of the first iteration. It then constraints itself to learn outliers, leading to an overfitting. 



the results in terms of success rate estimated by cross-validation, over the first 
500 iterations. It is clearly seen that Adaboost quickly diverges, confirming the 
overfitting phenomenon on noisy data. 

To improve the noise tolerance, two different strategies can be considered. 
Since Adaboost is noise sensitive, a first approach would consist in removing 
noisy data before the boosting procedure. This kind of preprocessing is a matter 
for prototype selection (PS) which tries to a priori delete irrelevant or noisy data 
[14,15]. PS aims not only at reducing storage complexity of costly algorithms but 
also at improving their performances. To reach these goals, three main types 
of examples have to be suppressed with efficient heuristics (see Fig. 2). The 
first category corresponds to outliers (example 1 on the figure). The second 
concerns examples located in the Bayesian error region (example 2), which do 
not bring discriminant information to the classifier. The last category represents 
the instances at the center of a cluster (example 3). Since those examples are 
correctly labeled by instances located on the convex hull of the cluster, they can 
be considered as useless, and thus suppressed to reduce storage constraints. 

The second approach for treating noisy data aims at taking them into ac- 
count during the induction process, i.e. during the construction of the weak 
hypotheses. Generally, suggested strategies to avoid overfitting propose a modi- 
fication of the weight update rule. In MadaBoost [12], the modification is very 
simple. Because the uncontrolled growth of the weights seems to be the root of 
the problems of Adaboost, MadaBoost bounds the weight assigned to each 
example by its initial probability. In this way, the weights of the examples can 
not become arbitrarily large as it happens in Adaboost. While the drawback 
of this approach is that the boosting speed is much slower than the one for Ad- 
aboost, the authors prove that their moderate weight scheme has the boosting 
properties. In [11], Freund presents an adaptive version of his “boost by major- 
ity” algorithm, suggested in [3], in a new algorithm, called BrownBoost. In 
this approach, the algorithm is optimized to minimize the training error within 
a pre-assigned number k of boosting iterations. As the algorithm approaches 
its predetermined end, it becomes less and less likely that examples which have 
large negative margins will become correctly labeled. According to this remark, 
Freund uses the following weighting function, which corresponds in fact to the 
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Fig. 2. Categories of irrelevant examples. 



probability density of a binomial variable, which depends on fc, on the index i of 
the current iteration, on the number of correct classifications made so far, and 
on the success probability 1 — 7 imposed to each weak hypothesis. 



L|j-ry4 4 



7) 



LlJ-i-H-r- 



This is a non-monotone function, which looks quite like the exponential func- 
tion for small negative margins. But beyond a certain value, the weight drops, 
particularly at the end of the k iterations. The advantage of this approach is that 
outliers will probably be detected, and their weight will thus stop to increase. 
However, they can be too lately identified, leading to a negative influence on 
the resulting hypothesis. In Gentle Adaboost [10], the authors also use a 
weighting scheme that uses a function with a lower growth than the exponential 
one. Boosting is viewed as an approximation to additive modeling on the logistic 
scale. By fitting an additive model of different and potentially simple functions, 
it expands the class of functions that can be approximated. Eventually, in [16], 
the authors present boosting as an optimization problem, in which slack vari- 
ables are introduced, that permits to relax the constraint over high weighted 
examples, which are the ones having the worst negative margins. 



2.2 Convergence Speed 

If the negative impacts of noisy data on boosting performances have been fre- 
quently mentioned in the literature, the causes of a slowing down of convergence 
have rarely been studied. However, we can easily assess the consequences of a 
high level of density overlap. Indeed, the weight of examples located in Bayesian 
error region are alternatively increased, because mislabeled, and decreased to 
allow the correct classification of examples of the other class. This, of course, 
has an impact on the final classifier iJ, which will generally need more iterations, 
and thus more weak hypotheses, to reach its optimum. To illustrate our remarks, 
Adaboost was run on an artificial sample containing a Bayesian error of 20%. 
Fig. 3 shows the success rate over 500 iterations. It can be seen that the optimal 
success rate of 80%, known a priori, is nearly reached after 400 iterations. 

In order to reduce the number of weak hypotheses, we proposed in [13] to 
modify Schapire and Singer’s theorem [ 6 ], pointed out below, which proves that 
the minimization of the error is ensured by minimizing each Zt- 

Theorem 1. The following bound holds on the training error of H : 
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Fig. 3. An example of delayed convergence due to the presence of an overlap. 



\{i : H{x) ^ t/JI < 
m “ 

We modified this theorem in [13] by integrating Bayes risk, allowing to encompass 
situations where many examples are sharing the same description. Consider the 
learning sample LS = (xi,y{xi)), {xm,y{xm)) containing m examples with 
representation x and class y{x). We define n~^{x) and n~{x) respectively as the 
number of positive and negative examples sharing this representation. Let 

^ \n+{x) -n-(x)| 

^ ^ {n+{x)+n-{x)) 



Theorem 2. The following bound holds on the training error of H : 



< {[[Zt)EDT+i[o[x)\+e 

m ^ 

where e* is the minimal error on LS and [i5(a:)] is the expectation of 5(x) 

on distribution Dt+i, such that 



Dt+i(x') 



TO lit 



The interest of this theorem is to integrate Bayes risk in this upper bound. It 
suggests to annihilate the effect of the examples located in the Bayesian error 
region, the reweighting process favoring those for which (5(a;) is highly in favor 
of one class. We modified the distribution Dt{x) with the following update rule: 



yx',D'^{x') 



J2xeLS* Dt{x')[Tr{x,x')]S{x) 
J2x"eLS J2xeLS* Dt{x") [ 7 t ( x , x")]S{x) 



Where LS* represents the set of examples containing only one instance of a 
given representation, where ['k(x,x')] = 1 if the predicate ^ x and x' are identical 
descriptions” is true and [7r(a;,2;')] = 0 if not. Note that if LS* = LS, we then 
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have Schapire and Singer’s results. The first expected effect of this theorem is 
related to a faster convergence to the optimal risk. From artificial experiments 
in feature selection [13], we have observed that the construction of up to 30% 
weak hypotheses can be saved. Nevertheless, from a practical point of view, 
the situations where the representation of two examples from different classes is 
strictly the same are very rare, especially with high representation dimensions. 
In the next section, we will try, through the use of the information contained 
in a neighborhood graph, to estimate if an example is potentially located in the 
Bayesian region, even if it does not share its representation with other examples. 

Note that other approaches, which do not belong to the two categories stated 
before, also tried to improve boosting performances. For example, algorithm 
JBOOST [17] aims at specializing weak hypotheses on examples they are supposed 
to correctly label. Thanks to boolean variables, each example is weighted in 
H{x) by a coefficient representing its adequacy with each ht. The experimental 
results show a quite good improvement of Adaboost. In [18], a new boosting 
algorithm is proposed, in which a test example gets the class of the weighted 
majority of examples having received the same proportion of vote among the T 
weak hypotheses. Noise tolerance as well as margin growth are also theoretically 
studied there. Finally, through RegionBoost [19], a new hypothesis weighting 
scheme is proposed. Weighting is evaluated at the moment of the vote thanks 
to the k nearest neighbors of the example to label. This approach permits to 
specialize each classifier on specific areas of the representation space. In spite of 
a performance improvement, the results show a certain noise sensitivity. 



3 The Algorithm zAdaboost 

The different strategies listed above for dealing with noisy data are not totally 
satisfying. Firstly, definitively removing noisy data during a preprocessing stage 
does not provide a moderate decision rule. Beyond the gain in terms of storage 
requirements, wrongly removed examples could lead to dramatic effects on the 
classifier performances. Secondly, as we said before, the step-by-step discovery 
of noisy data as done in BrownBoost is not also completely relevant. Actually, 
an outlier could have time to damage the final classifier. Moreover, the use of 
functions different from the exponential one (as in Madaboost), is likely to call 
into question some theoretical results on bounds. Finally, the use of the efficient 
soft margins [16] does not solve the problems linked to bi-modal distributions 
(with two peaks of density), where relevant instances could be seen as outliers, 
using weak learners such as stumps. To avoid forcing Adaboost to learn either 
noisy data or examples that would become too hard to learn during the boosting 
process, we are going to build a local information criterion around each example. 
This one will allow us not only to estimate overfitting risks, but also to evaluate 
if an example is located in the Bayesian error region. We build a geometrical 
graph, which will permit us to measure the information around each example. 
We consider here a very simple graph, the fc-nearest-neighbor graph (fcNN) [20], 
built on LS. 
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Definition 1. Let N{x) he the neighborhood ofx sueh that, N(x) = {x' G LSjx' 
is one of the kNN of x using the Euelidean distanee}. 

Note that it is possible to use other graphs, such as Gabriel graph, Relative 
Neighborhood graph, minimal spanning tree, Delaunay triangulation [21]. The 
properties of the fcNN graph, and in particular the fact of having an error rate 
bounded by twice the Bayesian error [22], led us to choose this graph. 

Definition 2. Letnf{x) (resp. nf{x)), be the sum of the weights Dt(x') of the 
examples x' having the same label as x (resp. opposite label of x) in N(x). 

ntix)= =J2x'(^N{x):yix')^y(x)Dtix') 

x' {x):y{x')—y{x) 



Definition 3. Let It{x), the level of confidence in the new weight Dt+i{x), at 
iteration t. It{x) is an application from [— 1,+1] in [0,0.25], defined as follows: 

It{x) = |7t(a^)l(l - | 7 t(a:)|) where jt{x) = 

Kixj + nt (a;) 

In order to have a coefficient ranging between 0 and 1, we will rather use 4/t(a;) 
for measuring the confidence in Dt+i{x). In order to adjust the weights, we use 
the following reweighting scheme, where Z'^ is the new normalization coefficient. 

n/ _ 4:Lt{x)Dt+i{x) 

x't+iyx) — 

The function It{x) is graphically described in Fig. 4. Note that this func- 
tion was built by merging two ideas coming from boosting and prototype se- 
lection. The first makes it possible to solve problems due to the systematic 
exponential growth of the weight of mislabeled examples. We aim here at using 
a non-monotone function, while if necessary keeping the exponential update for 
examples which deserve it. Note that for negative values of jt(x), It{x) looks 
like the binomial distribution used in [3], except that it does not depend on the 
iteration index, and can detect an outlier right at the beginning of the process. 

The second idea takes into account the different categories of irrelevant exam- 
ples listed in prototype selection. It{x) deals with outliers, overlaps and clusters 
as well. Indeed, irrelevant examples will be detected by a null | 7 t(a;)| (Bayesian 
error region), or equal to I (outlier or center of cluster). In all cases, It{x) will 
then be equal to 0, and the example will be annihilated. Between these two 
minima, the confidence in Dt+i{x) is a non-monotone function of jtix), which 
expresses the information level around each example. We introduced all these 
ideas in a new algorithm, called zAdaboost. Its pseudocode (see Algorithm 2) 
strongly looks like to Adaboost’s one, since the main modification concerns 
the information collected within the neighborhood graph, and its use in the 
computation of the confidence coefficient of Dt+i{x). 
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Fig. 4. The function It{x). 



Data : A learning sample LS, a number of iterations T, a weak learner WL 
Result : An aggregated classifier H 

Build the k nearest neighbor graph on LS ; 

Initialize distribution: \/x G LS, Di{x) = ; 

for t = 2 to T do 
ht = WL {LS,Dt); 

et= ^ Dt{x) and Of = | log(i^) ; 

e:y{x)^ht{x) 

Distribution Update; G LS Dt+i{x) — ; 

and D;+i(x) = 

^ where It{x) = |7t(®)l(l - lt{x)\) and 7t(x) = ; 

Return H s.t. H{x) = j:{Yl't=i^tht{x)) ; 



Algorithm 2: Pseudo-code of iAdaBoost.. 



4 Experimental Results 

The goal of this section is to compare, in terms of error and convergence speed, 
the algorithms Adaboost and i A daboost. Again, we use here stumps as weak 
hypotheses, and a 10-fold-cross-validation procedure. The A:NN graph is built 
using different values of k. The results listed in this section are those obtained 
with the optimal value. In order to highlight the ability of zAdaboost to deal 
with both problems of noisy data and class overlap, we achieved two types of 
experiments. In the first part, we worked on 11 databases of the UCI repository 
[23]^, in which we introduced 10% of noise. This way to proceed ensures then 
the presence of a minimum number of outliers in order to test the efficiency of 
our approach. The second series of experiments aims at controlling the ability 
of i Adaboost to improve the speed of convergence. We built a learning sample 
with two linearly separable classes, and we artificially generated a density overlap 
of a% of each distribution (a varying from 10 to 50%). 

They have been selected in the limited set of bi-class databases provided by the UCI. 
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Table 1. Performances in terms of success on 11 bases of the UCI repository. 



BASE 


Stump 
% succ 


Adaboost 
% succ 


zAdaboost 


k 


% SUCC 


White House 


85.89 


85.21 


5 


85.89 


Pima 


66.05 


67.61 


14 


69.17 


Ionosphere 


65.61 


70.90 


5 


74.61 


Glass 


64.00 


68.8 


6 


69.3 


Vehicle 


60.89 


67.7 


7 


68.5 


Tic-Tac-Toe 


59.45 


68.25 


8 


69.57 


Xd6 


55.80 


69.71 


6 


69.53 


Heart 


73.38 


72.28 


4 


73.39 


Echocardiogram 


55.05 


65.43 


7 


66.92 


Breast 


81.79 


84.64 


5 


87.35 


Car 


71.60 


74.19 


10 


74.20 


AVERAGE 


67.23 


72.25 


7 


73.49 
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Fig. 5. Scatter plot on 11 databases. 



Tab. 1 shows results obtained during the first study. We indicated the success 
rate (% succ) for Adaboost and jAdaboost. In order to show the effect of 
boosting, we also mentioned the success rate {stump) obtained with only one 
weak hypothesis built from the initial distribution. Finally the optimal value of 
k is indicated for each base. Note that in order to ensure a better comparison, the 
same folds were used during the cross-validation. The observation of the results 
shows that the positive effects of zAdaboost are indisputable. Indeed, on 10 
bases, our algorithm improves the performances of Adaboost. On all the 11 
databases, the gain with our method is about 1.2 (72.25 vs 73.49), that is highly 
statistically significant, using a Student test. Another concise way to display the 
results is proposed in Fig. 5. Each dot represents a database. A dot over the 
bisecting line expresses that zAdaboost is better than Adaboost. 
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Table 2 presents the results about the convergence speed of both algorithms. 
Several remarks can be made. The first one is relative to the iteration from 
which the algorithm has reached its optimum. Except for a small overlap level 
(10%), iAdaboost always reaches its optimum before Adaboost. The second 
characterizes the difference between the two algorithms in terms of convergence. 
Indeed the higher the overlapping rate is, the higher the difference between 
the two convergence iterations (T 2 — Ti) becomes. Adaboost is less and less 
efficient for quickly learn the two classes. Finally we tested, thanks to a measure 
of dispersion, if zAdaboost is more stable than Adaboost. We calculated the 
standard deviation (cti and CT 2 ) over the success rates, from the stabilization 
iteration of i A daboost to the end (T=1000). One can note that globally, our 
algorithm offers a higher stability (ranging from 0.1 to 0.3) than Adaboost, 
which is more disturbed by the successive weight updates (from 0.26 to 1.51). 



Table 2. Performances in terms of convergence speed. 



Overlap. 


^Adaboost 


Adaboost 


T2 - Ti 


I'l 


<7l 


4 2 


<72 




10 % 


200 


0.29 


200 


0.30 


0 


20 % 


150 


0.11 


250 


0.26 


100 


30 % 


170 


0.24 


350 


0.56 


ISO 


40 % 


300 


0.24 


800 


0.82 


500 


50 % 


375 


0.28 


1000 


1.51 


625 



5 Conclusion 

We proposed in this article a modification of the weight update of Adaboost in 
order to better take into account, not only noisy data, penalizing the success rate 
of the final classifier, but also the examples located in the Bayesian error region, 
slowing boosting convergence. Beyond the excellent performances obtained by 
our algorithm i Adaboost, we are currently working on theoretical justifications 
of such an update, and its effects on error bounds. It will be also important to 
theoretically verify that the use of this update rule does not call into question 
the main principles of boosting, and that margins keep on increasing with the 
number of iterations. 
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Abstract. In general, support vector machines (SVM), when applied to text 
classification provide excellent precision, but poor recall. One means of cus- 
tomizing SVMs to improve recall, is to adjust the threshold associated with an 
SVM. We describe an automatic process for adjusting the thresholds of generic 
SVM which incorporates a user utility model, an integral part of an information 
management system. By using thresholds based on utility models and the rank- 
ing properties of classifiers, it is possible to overcome the precision bias of 
SVMs and insure robust performance in recall across a wide variety of topics, 
even when training data are sparse. Evaluations on TREC data show that our 
proposed threshold adjusting algorithm boosts the performance of baseline 
SVMs by at least 20% for standard information retrieval measures. 



1 Introduction 

Generic support vector machines (SVMs) [19] provide excellent performance on a 
variety of learning problems including: handwritten character recognition [8], face 
detection [15] and most recently text categorization [6]. However, when generic 
SVMs are applied to text classification*, their performance, while being competitive 
with other approaches (e.g., Rocchio, naive Bayes) from a precision perspective, is 
not competitive from a recall perspective [6], [17]. 

Several attempts have been made to improve the recall of SVMs while not ad- 
versely affecting precision in a text classification context. The first category of such 
attempts falls under the label of uneven margin-based learning [12]. Here, a simple 
margin-based version of the perceptron learning algorithm is used to learn a model 
that has a pre-specified required positive and negative margin. The required positive 
and negative margins are heuristically determined using cross-validation on the train- 
ing corpus. The second category of proposed SVM improvements for text classifica- 
tion is cost-based and is also incorporated into the SVM learning algorithm. To 
counter the imbalance of positive training documents to negative training documents, 



* Text classification is a very active area of research and application in information manage- 
ment and is concerned with assigning a document to one or more pre-specified categories or 
classes. 

N. Lavrac et al. (Eds.): ECML 2003, ENAI 2837, pp. 361-372, 2003. 
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a higher cost is associated with the misclassification of positive documents than with 
negative documents [20]. Tuning the asymmetric misclassification cost can provide 
significant improvement, though this process can be prohibitively expensive. The 
final category of proposed SVM improvements for text classification is based on 
post-processing or thresholding the output value (or margin/score) of the learnt SVM. 
This is generally an inexpensive one-dimensional optimization problem which can 
lead to significant improvement in performance measures. 

This post-processing thresholding step is independent of the learning step. The 
critical step in thresholding is to determine the value, known as the threshold, at 
which a decision changes from labeling a document as positive to labeling a docu- 
ment as negative. Many of the approaches to thresholding that have been developed 
in other fields (such as information retrieval) can be applied directly in thresholding 
the score output of SVMs. Though thresholding has received a lot attention in the 
information retrieval sub-field of adaptive filtering, optimizing thresholds remains a 
challenging problem. The main challenge arises from a lack of labeled training data. 
Due to limited amounts of training data, standard approaches to information retrieval 
use the same data for both model fitting (learning) and threshold optimization. Con- 
sequently, this often biases the threshold to high precision, i.e., overfits the training 
data. 

The following provides a brief overview of information retrieval-based threshold- 
ing approaches: Yang presents an empirical study of a variety of thresholding strate- 
gies for text categorization using k nearest neighbors [22]; Zhai. et al. present a beta- 
gamma thresholding algorithm for adaptive filtering, which has been adapted in the 
thresholding strategy proposed in this paper [23]; Zhang and Callan propose a maxi- 
mum likelihood estimation of filtering thresholds [24]; Ault and Yang introduce a 
margin-based local regression approach for predicting optimal thresholds for adap- 
tive filtering [2]; Arampatzis describe a score-distributional threshold optimization 
approach [1]. 

Some of these IR approaches have been adapted already for thresholding SVMs. 
Cancedda et al. report one such approach to adjusting the threshold of SVMs based 
upon a Gaussian modeling process of the SVM scores (output value) for positive and 
negative documents for each category [3]. This Gaussian model is then used to gener- 
ate sample document scores and an optimal threshold is set to the score corresponding 
to maximum utility on the cumulative utility curve for the generated labeled scores. 
This approach, combined with asymmetric learning, has led to huge improvements in 
recall and precision, though it is hard to discern how much improvement can be at- 
tributed to the asymmetric cost learning strategy or to the thresholding strategy. This 
impact of adjusting the threshold will become clearer later in this paper when we 
show that it can boost significantly the performance of baseline SVMs for text classi- 
fication. 

In this paper, we adapt a procedure of setting the threshold of the learnt SVM us- 
ing the beta-gamma thresholding technique, developed previously for adaptive text 
filtering using information retrieval-based filters [22], a more challenging task than 
text classification. In addition, we present a novel and very cheap technique for se- 
lecting the parameters of the threshold adjustment strategy automatically, based upon 
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cross fold validation. This paper is organized as follows: Section 2 describes the pro- 
posed threshold adjustment algorithm after a brief overview of generic linear SVM 
modeling; Section 3 describes the experimental setup, detailing the explored variables 
and datasets used to evaluate the proposed approach; Section 4 presents the results of 
evaluations of the proposed approach and compares these to other approaches; Sec- 
tion 5 presents some concluding remarks. 

2 Proposed Thresholding Approach 

The proposed threshold adjustment algorithm is performed immediately after learning 
an SVM. In this section, we first present some background material on SVMs and 
then present the proposed threshold adjusting algorithm. 

2.1 Support Vector Machines 

Though support vector machines (SVM) were originally introduced by Vapnik in 
1979 [19], and have provided state-of-the-art performance for a variety of learning 
problems (and in some cases better than state-of-the-art), it is only recently that they 
have gained popularity in the text retrieval and classification community. Geometri- 
cally (for linear support vector machines), a learnt SVM model can be seen as a hy- 
perplane that separates a set of positive examples (belonging to the positive class) 
from a set of negative examples (negative class). This is illustrated in Figure 1, where 
H is a hyperplane that separates positive class examples (denoted by “+”) and nega- 
tive class examples (denoted by “-“). Mathematically a hyperplane can be represented 
as follows: 



This can be written more succinctly in vector format as <W,X>+b =0. Here W is 
known as a weight vector and corresponds to the normal vector to the separating 
hyperplane, H, and X is an input vector or document, b denotes the perpendicular 
distance from the hyperplane to the origin, n represents the number of input variables, 
in the case of text, this can be viewed as the number of words (or phrases, etc.) that 
are used to describe a document. The classification rule for an unlabeled document, 
X, using a support vector machine with separating hyperplane (W, b), is as follows: 



The distance from the hyperplane to the nearest positive or negative examples is 
known as the margin of the SVM. Learning a linear SVM can be simply thought of as 
searching for a hyperplane (i.e., the weights and bias values) that separates the data 
with the largest margin. As a result, learning for linearly separable data can be viewed 
as the following optimization problem: 




( 1 ) 



Class(X ) = Sign{{W, X) + b) 



( 2 ) 




subject to : 1 Vi = l,...,n 



minimize 



(3) 



V / 
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where is training example with label y. and 1 1 VR| | is the norm of the weight vector 
(i.e, In the case of non-linear separablity, two alternative formulations 

have been proposed: one is based upon slack variables; and the other is based upon 
using non-linear kernels (see [20]for more details). The slack variable or soft formu- 
lation of SVM learning [4] allows, but penalizes, examples that fall on the wrong side 
of the supporting hyperplanes and H ^ in Figure 1), i.e., false positives or false 
negatives. Different or asymmetric costs can be associated with false negatives and 
false positives. In practice, learning SVMs is more efficiently conducted in a dual 
space [19]. For our current study, two variations of the dual space Sequential Minimal 
Optimization (SMO) learning algorithm [16] were implemented and evaluated: 
SMOKl and SMOK2, corresponding to modification 1 and modification 2, respec- 
tively, as proposed by Keerthi et al. [7]. Our current implementation caters only for 
symmetric false positive and false negative costs. 




Fig. 1. A support vector machine in a two-dimensional input space, Word^xWord^, denoted by 
the hyperplane, H. Each document is associated with a category, or The support 

vectors correspond to the examples on the hyperplanes and H 



2.2 Thresholding Adjusting Algorithm for SVMs 

Optimizing thresholds is a challenging problem because the limited amount of train- 
ing available is generally required for training the base model, thereby, resulting in a 
situation where it is rare to have an independent sample solely for threshold optimiza- 
tion. Standard approaches in text classification and retrieval use the same data for 
both model fitting (learning) and threshold optimization [22]. Consequently, this 
often biases the threshold to high precision, i.e., the threshold overfits the training 
data. SVM learning algorithms focus on finding the hyperplane that maximizes the 
margin since this criterion provides a good upper bound of the generalization error. 
Learning based on this criterion leads to models with very good ranking ability (dem- 
onstrated empirically by the results in Section 4). However, the resulting separating 




Improving SVM Text Classification Performance through Threshold Adjustment 365 



hyperplane tends to be too conservative (high precision oriented). The natural thresh- 
old value for SVM learning and classification is zero (see Equation 2). Here, we pro- 
pose to combine the powerful ranking ability of SVMs with the beta-gamma thresh- 
olding algorithm [22] to reset the threshold of the learnt SVM in order to overcome 
this precision-oriented limitation. The powerful ranking ability of SVMs is only ex- 
ploited for threshold adjustment, and is not used in classification (as each document is 
classified independently of each other). The beta-gamma thresholding algorithm 
relaxes the SVM threshold from zero, i.e., translates the SVM hyperplane towards the 
denser class (i.e., the class with more training data). In addition to adapting the beta- 
gamma algorithm for adjusting the SVM threshold, we propose a novel means for 
setting the parameters of this algorithm - beta and gamma - using a cheap cross vali- 
dation mechanism. 

We first present the core beta-gamma thresholding strategy, and subsequently de- 
scribe how this can be used with cross validation to empirically determine the beta 
and gamma parameters. The beta-gamma thresholding strategy consists of the follow- 
ing steps and uses as input a category label, C, a labeled dataset, 7) of documents 
consisting of both positive and negative examples of C, a learnt SVM, M, that models 
the category C, /i, the threshold adjustment parameter, and UtililtyMeasure, a utility 
measure that models the user’s expectations: 

SetSVMThresholdUsingBetaGamma(C, T, M, /?, UtililtyMeasure} 

1. Rank the thresholding dataset, T, using the SVM, M, as scoring function, thereby yielding 
a ranked document list R consisting of tuples <Document., SVMScore>. 

2. Generate the cumulative utility curve for R, i.e., for each document in the ranked list R 
compute the cumulative utility using the utility measure UtililtyMeasure. 

3. Determine the rank or indices of the maximum utility point on the cumulative curve and 
the first zero utility point following the maximum utility point. Denote these respectively 
as i^^, and i^^^. Assign the variables 0^,^ and the output scores of the SVM, M, for 
the documents associated with the maximum and zero utility points respectively, i.e., the 
SVM scores of the documents at rank i^^, and i^^^. (See Figure 2 for a graphic illustration 
of this step.) 

4. Return the threshold, 0, which is calculated as follows: 

^ = + (4) 

In the procedure outlined above, fi is either provided heuristically or determined 
using the beta-gamma cross-validation procedure outlined below. The following is a 
more sophisticated version of this threshold adjustment algorithm (Equation 4) that 
takes into account the number of positive training examples used in T\ 

a = p+{\-P}e-"'’. 

In this equation, p denotes the number of positive documents in the thresholding 
dataset, T. The y component of this threshold relaxation formulation provides a 
mechanism to further relax the threshold based entirely upon P (Equation 4). This 
will have biggest impact on the threshold when there are very few documents. Once, 
again, as is the case for P, j is either provided heuristically or determined using the 
beta-gamma cross-validation procedure outlined below. 
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Fig. 2. Determining and using a ranked list of training documents. 

Now, we outline a procedure based upon n-fold cross validation to automatically 
determine the values of P and y in the threshold relaxation procedure. It consists of 
the following steps and uses as input a category label, C, a labeled dataset, T, of docu- 
ments consisting of both positive and negative examples of C (for example, T could 
be a subset or the complete training dataset), a learnt SVM, M, that models the 
category C, P, the threshold adjustment parameter, UtililtyMeasure, a utility measure 
that models the user’s expectations, Ps (valid values for P are positive or negative real 
numbers), the set of possible beta values, ys, the set of possible gamma values, and n, 
the number of folds that will be used in parameter selection. 

SelectOptimalSVMThreshold(C, T, M, UtililtyMeasure, Ps, ys, n) 

1. Partition the data into n non-overlapping subsets of the data ensuring that both positive 
and negative documents are present in each fold or subset. 

2. Foreach each combination of /? and y values in Ps and 75 do steps 3 and 4 

3. Foreach fold n 

• Set to the n-1 folds 

• Set 0= SetSVMThresholdUsingBetaGamma(C, J, M, P, y, UtililtyMeasure) 

• Set Utilityp., = Calculate the utility for M and the threshold, 0, over the fold n. See 
Equation 6 for an explanation of how to use an adjusted threshold in conjunction 
with an SVM. 

4. Compute the average utility as follows: Utilitypy= Utilitypyn 

5. End Eoreach 

6. Calculate the optimal threshold, 0^, using the p and y combination that has the highest 
average utility Utilitypy as follows: SetSVMThresholdUsingBetaGamma(C, T, M, P, y, 
UtililtyMeasure) 

7. Return 6[,p. 

The SVM classification rule is altered slightly as follows to accommodate the ad- 
justed threshold: 

Class{X) = Sign[{W,X) + b-0^^) (6) 

For our experiments, we adapted the TlOU linear utility measure (see Table 2) for 
threshold optimization, as this provides an intuitive user utility model that generally 
leads to improved recall and precision when used as a cost function in learning [17]. 
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3 Experimental Setup 

This section describes the experimental variables, the experimental performance 
measures and the datasets used for this study. The parameters settings explored in the 
experiments reported in this paper are summarized in Table 1. All are pretty much 
self-explanatory, apart from how a document is represented. We represent a docu- 
ment as a vector of terms that is derived as follows: we replace all numerical and 
punctuation characters by spaces and eliminate stop-words such as articles and prepo- 
sitions, etc.; each term is associated with a TFxIDF weight, where TF denotes the 
frequency of a term in a document, and IDF is calculated based on the distribution of 
the term in the training corpus [18]. In all experiments the document vectors were 
normalized to unit length. 

In our analysis, we examined several information retrieval performance measures 
which are presented in Table 2 along with their definitions. 



Table 1. Learning decision variables and explored values. 



Decision Variable 


Explored Values 


Learning Algorithm 


SMOK2 


C (Upper bound for 
Lagrange multipliers) 


0.4, 0.8, 0.9, 1, 2, 5 


Tolerance 


0.001 


Type of kernel 


Linear 


Sampling Ratio 


Used all training data 


Number of terms k 


Use all terms 


Term types 


White space delimited tokens with numbers, punctuation, and 
stopwords removed 


Term weighting 


TF_IDF 



For our current study, we have performed an evaluation of learning threshold ad- 
justed SVM classifiers (TSVMs) on the following classification corpora: Reuters- 
21578 ModApte split collection [10] and TREC2001 corpus [17]. The main reasons 
for choosing these corpora include the following: these corpora are commonly used in 
benchmarking text classification problems; the Reuters-21578 corpus is a manageable 
size thereby enabling extensive experimentation (without being computationally 
prohibitive). The details of each corpus are presented below. 

3.1 Reuters-21578 (ModApte Split) 

The Reuters-21578 collection contains 12,902 newswire stories that had been classi- 
fied into 118 categories (e.g., corporate acquisitions, earnings, money market, grain, 
and interest) [10]. We followed the ModApte split in which 75% of the stories (9603 
stories) are used to build classifiers, while the remaining 25% (3299 stories) are used 
to test the accuracy of the resulting models in reproducing the manual category as- 
signments. Only 90 categories are modeled in our experiments. These 90 categories 
were selected based upon having at least one training and one testing example. 
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Though only 90 categories were modeled, the examples belonging to the non- 
modeled categories were used for training and testing. 

3.2 TREC 2001 Corpus 

The TREC 2001 Corpus, officially known as “Reuters Corpus, Volume, English Lan- 
guage, 1996-08-20 to 1997-08-19”, contains one year of Reuters newswire stories in 
English, corresponding to 1.5 GB of data, or 810,000 news stories taken from the 
period August 1996 - August 1997. Each story has been assigned one or more cate- 
gory labels from 84 possibilities. The training dataset is limited to the last 12 days of 
August 1996 (corresponding to approximately 23,000 examples); the remaining 11 
months are designated as test data. More information about this corpus can be found 
at http://about.reuters.com/researchandstandards/corpus [17]. 

Table 2. Evaluation measures and their definitions, where R’^, N’^, R', and N' are true positives, 
false positives, false negatives and true negatives respectively. 



Evaluation Measure 


Definition 


Precision 


R* 

P — 


Recall 


R* 

^ R*+R- 


Fp 


^ p*r 

^ iP^*p)+r 


Tiou/r;;t7 


T10U= TllU = 2R*-1N* 


TIOSU 


rnax(TlOU,MinU)-MinU 

MaxU-MinU 

where MaxU =2 *{r^ +R~) and MinU =-100 


TllSU 


max{TUNU,MinNU)-MinNU 

l-MinNU 

nit/ 

where TWNU = and MinNU =-0.5 

MaxU 



4 Results and Empirical Observations 

In the case of all examined corpora, a topic- specific binary classifier was learned 
from the training data that models the topic (positive examples) and the not-topic 
(negative examples). The values explored for (3 were restricted to the following list: 
(-0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 
0.75, 0.8}, while y was set to 100 (effectively disabled). The y parameter was disabled 
after noticing no discernable improvement from using it in the context of classifica- 
tion, though this parameter proved to be crucial in an adaptive text filtering context 
[22], where a topic is defined differently and its definition is adapted over time; usu- 
ally a topic is defined in terms of a focused query and a small number of explicitly 
labeled documents; and its definition is refined over time upon receiving user feed- 
back. 
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The proposed thresholding approach is compared against the following ap- 
proaches: baseline (unthresholded) SVMs; other threshold adjusting SVM ap- 
proaches; asymmetric (misclassification costs) SVMs; and traditional IR approaches. 

Figure 3 compares the results for the Reuters-21578 corpus between the threshold 
adjusted SVMs and baseline SVMs for each topic with respect to the TllSU evalua- 
tion measure. For this graph of results, and for subsequent graphs of results, the hori- 
zontal axis represents the topics (considered in a corpus), ranked in decreasing order 
of the number of positive training data available for that topic. This graph has two 
primary vertical or y axes; the left vertical axis corresponds the log (base 10) of num- 
ber of training documents; the right vertical axis corresponds to the difference in 
performance for the indicated measure (TllSU in the case of Figure 3) between the 
threshold adjusted SVM (denoted as SVMThresh) and the baseline SVM. Positive 
bars for this measure correspond to an improvement in performance when threshold 
adjustment is used. Table 3 presents the macro-average results for precision, recall, 
Fbeta, and T1 ISU for the Reuters-21578 corpus. 




Table 3. A results comparison for the Reuters 87 ModApte corpus. 



Approach 


TllSU 


Fa.,- 


Precision 


Recall 


CC Thresholded SVMs 


0.61 


0.57 


0.64 


0.48 


Linear SVM 


0.54 


0.48 


0.58 


0.33 



Overall, we can see that adjusting the threshold using the beta-gamma procedure 
boosts the performance of the baseline SVM on all examined evaluation measures at 
a macro level for the Reuter-21578 corpus (Table 3). Examining each topic from a 
TllSU perspective (Figure 3), we notice that the biggest improvement in TllSU 
performance comes from topics that have fewer than fifty positive training documents. 
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topics that have traditionally being very difficult to model. Overall, 80% of the topics 
have improved or have not been adversely affected by this procedure. 

Figure 4 compares the results for the TREC 2001 corpus between the threshold ad- 
justed SVMs and baseline SVMs for each topic with respect to the TIOSU evaluation 
measure. Table 4 presents the macro-average results for precision, recall, Fbeta, and 
TllSU for the TREC 2001 corpus. The K-NN result in Table 4 corresponds to a ^ 
nearest neighbor approach [2]. The IR result is achieved using traditional information 
retrieval filters [1]. The RBF SVM result in Table 4 was achieved using SVMs and 
radial basis kernels [13]. 



TREC2001- TIOSU 



^™ThresT10SU-T10SU _^# Train . 



10000 



S 

<a 

Q 100 




1 




0.5 
0.4 
0.3 ^ 

o 

K 
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K 

0.1 o 

P 

0 



Fig. 4. The difference in TIOSU performance for the TREC2001 corpus. 



Table 4. A results comparison for the TREC2001 corpus. 



Approach 


TIOSU 




Precision 


Recall 


Asymmetric SVM [11] 


0.41 


0.60 


0.75 


0.45 


CC Thresholded SVMs 


0.40 


0.56 


0.64 


0.50 


K-NN [2] 


0.32 


0.49 


0.63 


0.36 


Linear SVM 


0.31 


0.50 


0.75 


0.31 


IR[1] 


0.31 


0.51 


0.57 


0.41 


RBF SVM [13] 


0.28 


0.46 


0.55 


0.44 



Adjusting the threshold of the SVM for the TREC2001 topics has boosted recall 
and therefore led to over 20% improvement in terms of TllSU performance over 
baseline SVMs (linear SVM), while not effecting precision. This performance is 
comparable with the best performer for this text classification task that was prepared 
by Lewis [11]. Lewis’s submission was generated using asymmetric SVMs. The fol- 
lowing observations can be made when we compare evaluation measures for our 
threshold adjusted experiment and Lewis’s asymmetric run: first of all, due to the 
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expensive cross fold validation required for determining the asymmetric costs of the 
SVM learning, training Lewis’s asymmetric SVMs took two orders of magnitude 
more time to learn than our threshold adjusted SVMs (i.e., 500 hours for Lewis’s 
experiment versus 5 hours for our experiment); Lewis’s experiment with asymmetric 
SVMs provides 14% better precision than our threshold adjusted run; our threshold 
adjusted run provides 11% better recall than Lewis’s run; this would seem to suggest 
that asymmetric SVMs and adjusting the threshold are addressing two independent 
aspects of the problem, which if combined could boost performance even further. 

5 Conclusions 

We have presented a novel SVM threshold adjusting algorithm. It uses cross valida- 
tion to automatically determine the optimal parameters for the beta-gamma algorithm, 
which are subsequently used to relax the threshold of the class model. The proposed 
approach boosts the recall performance of baseline SVMs for text classification, 
while not adversely affecting precision. The gain in performance for examined TREC 
corpora is over 20% for standard information retrieval measures when compared to 
baseline SVMs. The extra cost of performing this threshold adjustment is small, in 
that it is a one-dimensional optimization problem. Adjusting the threshold of SVMs is 
just one technique for boosting the performance of SVMs. Combining our threshold 
adjustment algorithm with other techniques, such as asymmetric cost-based learning 
of SVMs, should lead to even better performance. This is part of ongoing work. A 
more detailed comparison between the proposed approach and other thresholding 
approaches that have or can be applied to the task of threshold adjustment for SVMs 
is currently being carried out. In addition, since the proposed thresholding approach is 
independent of the learnt model, using it in conjunction with other types of models 
will also form an interesting aspect of future work. 
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Abstract. The Data Oriented Parsing (DOP) model currently achieves 
state-of-the-art parsing on benchmark corpora. However, existing DOP 
parameter estimation methods are known to be biased, and ad hoc ad- 
justments are needed in order to reduce the effects of these biases on 
performance. In contrast with earlier work, in this paper we show that 
the DOP parameters constitute a hierarchically structured space of cor- 
related events (rather than a set of disjoint events). The correlations 
between the different parameters can be expressed by an asymmetric 
relation called “backoff’ . Subsequently, we present a novel recursive esti- 
mation algorithm that exploits this hierarchical structure for parameter 
estimation through discounting and backoff. Finally, we report on exper- 
iments showing error reductions of up to 15% in comparison to earlier 
estimation methods. 



1 Introduction 

The Data Oriented Parsing (DOP) model currently exhibits state-of-the-art per- 
formance on benchmark corpora [1]. A DOP model is trained on a treebank^ by 
extracting all subtrees of the treebank trees and employing them as the basic 
rewrite events (or productions) of a formal grammar. The problem of how to 
estimate the probabilities of the subtrees from the treebank turns out not as 
straightforward as originally thought. So far, there exist three suggestions for 
parameter estimation [2,3,1]. As shown in [3,4] and in the present paper, all 
three estimation procedures turn out to be biased in an unintuitive manner. 
Therefore, the problem of how to estimate the DOP parameters in a productive, 
yet computationally reasonable manner, remains unsolved. 

Parameter estimation for DOP is complex due to two unique aspects of the 
model: (1) a parse-tree in DOP is often generated through multiple, different 
rewrite derivations, and (2) the model consists of treebank subtrees of arbitrary 
size. These two aspects distinguish DOP from other existing models that are pre- 
dominantly based on the paradigm of History-Based Stochastic Models (HBSG) 

* We thank Rens Bod, Detlef Prescher and the reviewers for their comments. 

^ A treebank is a sample of parse-sentence pairs drawn from a domain of language 
use; the parses are the correct syntactic structures as perceived by humans. 
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Fig. 2. Two different derivations of the same parse 



[5]. An HBSG generates every parse-tree through a unique derivation involving 
rewrite production that can be considered, to a large extent, disjoint events. This 
allows for simpler estimation procedures and more efficient parsing algorithms 
than is possible for current DOP models. 

This paper addresses the DOP parameter estimation from a different angle 
than preceding work. Crucially, we observe that the space of subtrees of a DOP 
model does not merely constitute a set of disjoint events, but that it constitutes 
a hierarchical space. This space is structured by a partial order between the 
different derivations of the same subtree; the more independence assumptions 
a derivation involves, the lower it is in this hierarchy, just like different n-gram 
orders of the same word string. This partial order between a subtree and its 
derivations is characterized by the relation of “backoff’, defined in the sequel. 
Subsequently, a DOP model can be viewed as an interpolation of different orders 
of derivations: a subtree, the derivations of that subtree obtained by one, two, . . . 
independence assumptions between smaller subtrees. Based on this observation 
we suggest to combine the different derivations through backoff, rather than 
interpolation. This view leads to a simple, yet powerful recursive estimation 
procedure. The new DOP model. Backoff DOP, leads to improved parsing results. 
We report on experiments that show a 10-15% error reduction on a treebank on 
which the original DOP model already achieves excellent results. 



2 The DOP Model 

Like other treebank models, DOP extracts a finite set of rewrite productions, 
called subtrees, from the training treebank together with probability estimates. 
A connected subgraph of a treebank tree t is called a subtree iff it consists of 
one or more context-free productions^ from t. Following [2], the set of rewrite 
productions of DOP consists of all the subtrees of the treebank trees. Figure 3 
exemplifies the set of subtrees extracted from the treebank of Figure 1. 

^ Note that a non-leaf node labeled p in tree t dominating a sequence of nodes la- 
beled Cl, - ■ ■ ,Cn consists of a graph that represents the context-free production: 
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Fig. 3. The subtrees of the treebank in Figure 1 



DOP employs the set of subtrees in a Stochastic Tree-Substitution Grammar 
(STSG): an TSG is a rewrite system similar to Gontext-Free Grammars (GFGs), 
with the difference that TSG productions are subtrees of arbitrary depth^. 

A TSG derivation proceeds by combining subtrees using the substitution op- 
eration o starting from the start symbol S of the TSG. In contrast with GFG 
derivations, multiple TSG derivations may generate the same parse^. For ex- 
ample, the parse in Figure 1 can be derived at least in two different ways as 
shown in Figure 2. In this sense, the DOP model deviates from other contem- 
porary models, e.g. [8,9], that belong to the so-called History-Based Stochastic 
Grammar (HBSG) family [5]. 

An Stochastic TSG (STSG) is a TSG extended with a probability mass 
function P over the set of subtrees: the probability of subtree t, that has root 
label R(, is given by P(t|Rt), i.e. for every nonterminal A: X[{t|Rt=A} I ^) = 1- 
Given a probability function P, the probability of a derivation D = Sotio- ■ -otn 
is defined by P{D \ S) = n”=i The probability of a parse is defined 

by the sum of the probabilities of all derivations in the STSG that generate that 
parse. 

When parsing an input sentence U under a DOP model, the preferred parse T 
is the Most Probable Parse (MPP) for that sentence: argmaxy P{T\U). However, 
the problem of computing the MPP is known to be intractable [10]. In contrast, 
the calculation of the Most Probable Derivation (MPD) D for the input sentence 
U i.e., argmax£) P{D\U), can be done in time cubic in sentence length. 

In this paper we address another difficulty that arises from the property of the 
multiple derivations per parse with regard to the DOP model: how to estimate 
the model parameters (i.e. subtree probabilities) from a treebank. 



® The depth of a tree is the number of edges along the longest path from the root to 
a leaf node. 

Note the difference between parses and subtrees: the first are generated, complex 
events while the latter are atomic, rewrite events. 
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3 Existing DOP Estimators 

All three existing DOP estimators are biased, either by giving too much proba- 
bility mass to large/small subtrees or by overfitting the training data. 

(1) Subtree relative Frequency: The first instantiation of a DOP model is due 

to [11] and is referred to as DOPrf. In this model, the probability estimates of 
subtrees extracted from a treebank are given by a relative frequency estimator. 
Let /(t) represent the number of times t occurred in the bag of subtrees ex- 
tracted from the treebank. Then the probability of t in DOPrf is estimated as: 
PrfW^t) = ■ 

Using heuristics to limit the unwanted biases, the model achieved 89.7% in 
labelled recall and precision on the Wall Street Journal treebank [1] . Despite good 
performance, DOPrf estimator has been shown to be biased and inconsistent 
[4]. As argued in [3], DOPrf overestimates the probability of large subtrees. 
Furthermore, DOPrf’s good performance can be attributed to limitations on the 
set of subtrees extracted from the treebank (e.g. subtree depth upper bounds). 
These constraints reduce the model’s bias, leading to improved performance. 

(2) Bonnema’s Estimator: In [3], an alternative estimator for DOP is proposed. 

It assumes that every treebank parse represents a uniform distribution over all 
possible derivations that generated that parse in the model. Thus, the probability 
of a subtree t is estimated by taking the relative frequency of t along with the 
fraction of derivations of the treebank parses in which t participates. This leads 
to the following estimate: P{t\Rt) = Prf{t\Rt), where N{t) is the number 

of non-root nonterminal nodes of subtree t and Prf is the original DOPrf’s 
relative frequency estimator. The estimator defines a new DOP model which we 
refer to as DOPbou model. Next we show that the DOPbou estimator is biased 
towards smaller subtrees. 




Fig. 4. /Example illustrating DOPboti’s bias towards small subtree fragments. Given 
the training treebank (a-c), the correct parse to abode fgh should be (a). DOPsonmodel 
chooses (d) instead 



Consider the treebank in Figure 4(a~c) with 17 subtrees. According to the 
treebank, the correct analysis for string abcdefgh should be the one in Fig- 
ure 4(a). DOPbou, however, prefers the parse in Figure 4(d). In other words, 
a derivation that was actually seen in the treebank (i.e. the parse tree yielding 
abcdefgh in Figure 4(a)) becomes less likely than a newly constructed parse 
involving the subtree in Figure 4(b)! 
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( 3 ) Maximum-Likelihood: One might say that DOPrf estimator is biased be- 
cause it is not a Maximum-Likelihood (ML) estimator. This is in fact the ap- 
proach taken in [12], where the Inside-Outside algorithm is used for estimation 
of DOP model parameters from a treebank under the assumption that the model 
has a hidden element (the derivations that generated the parses of the treebank) . 
However, as [3] pointed out, ML for DOP results in a model that overfits the 
treebank. We exemplify this next. 

Let be given a treebank with trees ti, T 2 , both having the same root label X. 
The ML probability assignment to the subtrees extracted from this treebank is 
given as follows: for ti and T 2 , P{t\\X) = P{t 2 \X) = 1/2; for all other X-rooted 
subtrees t, P{t\X) = 0. In other words, probability zero is given to all parses not 
present in the treebank, resulting in a model that overfits the data and has no 
generalization power. 



4 A New Estimator for DOP 

In this section we develop a completely different approach to parameter esti- 
mation for DOP than earlier work. Consider the common situation where a 
subtree® t is equal to a tree generated by a derivation t\o ■ ■ ■ otn involving mul- 
tiple subtrees t\ - ■ ■ tn- For example, subtree sl7 (Figure 3) can be constructed 
by different derivations such as (sl6 o s2), (sl4 o si) and (sl5 o si o s3). We 
will refer to subtrees that can be constructed from derivations involving other 
subtrees with the term complex subtrees. 

For every complex subtree t, we restrict® our attention only to the derivations 
involving pairs of subtrees; in other words, we focus on subtree t such that 
there exist subtrees ti and t 2 such that t = (ti o ^ 2 )- In DOP, the probability 
of t is given by P(t|R(). In contrast, the derivation probability is defined by 
P{ti of 2 |Rti) = P{ti\^ti)P{t 2 \^t 2 ) ■ However, according to the chain rule (note 
that Rf = Rt J 

P(ti oO|Rq) = Pih\Rt^)Pit2\tl,Rt^) 

Therefore, the derivation t\ o t 2 embodies an independence assumption realized 
by the approximation^: P{t 2 \ti) ~ P{t 2 \Rt 2 )- This approximation involves a 
so-called backojf, i.e. a weakening of the conditioning context from P{t 2 \ti) to 
P{t 2 \Rt 2 )- Hence, we will say that the derivation ti o t 2 constitutes a backoff of 
subtree t and we will write (t >bfk t\ o ^ 2 ) to express this fact. 

® According to the definitions in section 2, the term “subtree” is reserved for the 
tree-structures that DOP extracts from the treebank. 

® Because DOP takes all subtrees of the treebank, if complex subtree t has a derivation 
ti o t2 o ■ ■ ■ o tn, then the tree resulting from t\ o t 2 is a complex subtree also. For 
example, in Figure 3, sl7 can be derived through (sl5 o si o s3)\ sl5o si generates 
subtree sl3. Hence, derivations of t that involve more than two subtrees can be 
separarted into (sub)derivations that involve pairs of subtrees, each leading to a 
complex subtree. Therefore, for any complex subtree t, we may restrict our attention 
to derivations involving only pairs of subtrees i.e., t = ti o t 2 - 
Note that R *2 is part of ti (the label of the substitution site). 
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The backoff relation >bfk between a subtree and a pair of other subtrees 
allows for a partial order between the derivations of the subtrees extracted from 
a treebank. A graphical representation of this partial order is a directed acyclic 
graph which consists of a node for each pair of subtrees that constitute a 
derivation of another complex subtree. A directed edge points from a subtree ti 
in a node® to another node containing a pair of subtrees iff ti >bfk tjotk- 

We refer to this graph as the backoff graph. A portion of the backoff graph for the 
subtrees of Figure 3 is shown in figure 5 (where sO stands for a subtree consisting 
of a single node labeled S - the start symbol) . 




Fig. 5. A portion of the backoff graph for the subtrees in Figure 3 



We distinguish two sets of subtrees: initial and atomic. Initial subtrees do 
not participate in a backoff derivation of any other subtree. In Figure 3, subtree 
sI7 is the only initial subtree. Atomic subtrees are subtrees for which there are 
no backoffs. In Figure 3, these are subtrees of depth one (double circled in the 
backoff graph). 

In the DOP model (under any estimation procedure discussed in section 3), 
the probablity of a parse-tree is defined as the sum of the probabilities of all 
derivations that generate this parse-tree. This means that DOP interpolates, 
linearly and with uniform weights, derivations involving subtrees from different 
levels of the backoff graph; this is similar to the way Hidden Markov Models 
interpolate different Markov orders over, e.g. words, for calculating sentence 
probability. Hence, we will refer to the different levels of subtrees in the backoff 
graph as the Markov orders. 

Backoff DOP: Crucially, the partial order over the subtrees, embodied in the 
backoff graph, can be exploited for turning DOP into a “backedoff model” as 
follows. A subtree is generated by a sequence of derivations ordered by the backoff 
relation. This is in sharp contrast with existing DOP models that consider the 
different derivations leading to the same subtree as a set of disjoint events. 
Next, after we review smoothing and Katz Backoff, we present the estimation 
procedure that accompanies this new realization of DOP as a recursive backoff 
over the different Markov orders. 



In a pair (th,ti) or {ti,th) that constitutes a node. 



8 



Backoff Parameter Estimation for the DOP Model 



379 



Estimation vs. Smoothing: It is common to smooth a probability distribution 
P{t\X,Y) by a backoff distribution e.g. P(t\X). The smoothing of P{t\X,Y) 
aims at dealing with the problem of sparse-data (whenever the probability 
P{t\X,Y) is zero). P{t\X) can be used as an approximation of P{t\X,Y) under 
the assumption that t and Y are independent. Smoothing, then, aims at enlarg- 
ing the space of non-zero events in the distribution P{t\X, Y). Hence, the goal of 
smoothing differs from our goal. While smoothing aims at filling the zero gaps in 
a distribution, our goal is to estimate the distribution (a priori to smoothing it). 
Despite these differences, we employ a backoff method for parameter estimation 
by redistributing probability mass among DOP model subtrees. 

Katz Backoff: [ 6 , 7 ] is a smoothing technique based on the discounting method 
of Good- Turing (GT) [ 13 , 7 ]. Given a higher order distribution P{t\X,Y), Katz 
backoff employs the GT formula for discounting from this distribution leading 
to PcT{t\X, Y). The probability mass that was discounted (1 — PcritlX, Y)) 
is distributed over the lower order distribution P{t\X). 

Estimation by Backoff: We assume initial probability estimates Pf based on 
frequency counts, e.g. as in DOPrf or DOPson- The present backoff estimation 
procedure operates iteratively, top-down over the backoff graph, starting with the 
initial and moving down towards the atomic subtrees. In essense this procedure 
transfers, stepwisely, probability mass from complex subtrees to their backoffs. 

Let represent the current probability estimate resulting from i previous 
steps of re-estimation (initially, for i = 0 , P° := Pf). After i steps, the edges 
of the backoff graph lead to the current layer of nodes. For every t, a subtree 
in a node from the current layer in the backoff graph, an edge e outgoing from 
t stands for the relation {t >bfk ti 0^2), where (^1,^2) is the node at the other 
end of edge e. We know that P^{t2\ti) is backedoff to P'^(t2|Rt2) since P‘^(f|Rt) = 
P°(fi|Rt JP'^(t2|ii) and P°(ti o t2) = P°(ti|RtjP“(t2|Rt2)- We adapt the Katz 
method to estimate the Backoff DOP probability Pbo as follows: 



where a{t\) is a normalization factor that guarantees that the sum of the prob- 
abilities of subtrees with the same root label is one. Simple arithmetic leads to 
the following formula: a{ti) = 1 — t2)>o -^G t(^ 2|H)- Using the above 

estimate of Pho(OjG), the other backoff estimates are calculated as follows: 



Before the next step i + 1 takes place over the next layer in the backoff graph, 
the current probabilities are updated as follows: P®+^(fi|Rt J := P{,o(ti|Rti). 

Note that Pbo is a proper distribution in the sense that for all nonterminals 
A: '^tP{t\A) = 1 . This is guaranteed by the redistribution of the reserved 
probability mass at every step of the procedure over the layers of the backoff 
graph. Furthermore, we note that the present method is not a smoothing method 
since it applies Katz Backoff for redistributing probability mass only among 




TGT(i2|G) + a{ti) P/(t2|Rt2) 
a(h) P/(t2|Rt2) 



otherwise 



[P=(t2|ti) > 0] 



Pbo{t\Ht) :=P/(ti|Rq) Pcrihlti) Pbo{ti\^t,) ■= ( l + a(G) ) P/(U|RtJ 
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subtrees that did occur in the treebank. The present method does not address 
probability estimation for unknown/unseen events. 

Current Implementation: The number of subtrees extracted from a tree-bank 
is extremely large. In this paper, we choose to apply the Katz backoff only to 
t ^bfk tiot 2 iff t 2 is a lexical subtree i.e., t 2 — >■ w where X is a Part of Speech 

(PoS) tag and w a word. Our choice has to do with the importance of lexicalized 
subtrees and the overestimation that accompanies their relative frequency. All 
experiments reported here pertain to applying the backoff estimation procedure 
to this limited set of subtrees (while the probabilities of all other subtrees are 
left untouched). 



5 Empirical Results 

OVIS Corpus and Evaluation Metrics: The experiments were carried out on the 
OVIS corpus, a Dutch language, speech-based, dialogue system that provides 
railway information to human users over ordinary telephone lines [14] . The cor- 
pus contains 10,049 syntactically and semantically annotated utterances which 
are answers given by users to the system’s questions (e.g. “From where to where 
do you want to travel?”). Utterances are annotated by a phrase-structure scheme 
with syntactic-l-semantic labels. 

The corpus was randomly split into two sets: i) a training set with 9,049 
trees; ii) a test set with 1,000 trees. The experiments were carried out using the 
same train/test split, unless stated otherwise. We report results for sentences 
that are at least two- word long (as 1 word sentences are easy). Without 1-word 
sentences, the average sentence length is 4.6 words/sentence. 

Three accuracy measures were employed: exact match and recall/precision 
(F-score) of labeled bracketing [15]. Furthermore, we compute models using the 
error reduction ratio, the ratio between the percent point improvement of model 1 
over model 2 normalized by the global error of model 2. 

Subtree space was reduced by means of four upper bounds on their shape: 
1) depth (d), 2) number of lexical items (1), 3) number of substitution sites 
(n) and 4) number of consecutive lexical items (L). Most Probable Derivation 
(MPD) and Most Probable Parse (MPP) were used as the maximization enti- 
ties to select the preferred parse®. 

Naming convention: We tested the new estimator under two different counting 
strategies: DOPrf and DOPbou- The following naming convention was used: 
DOPrf (as in [2]), DOPson (as in [3]), BF-DOPrf (backoff estimator ap- 
plied to DOPrf frequencies), and BF-DOPboti (backoff estimator applied to 
DOPbou frequencies). 

Accuracy vs. Depth Upper Bound: Figure 6 shows exact match results as a 
function of subtree depth upper bound. The subtrees were restricted to at most 2 

® A complete specification of the algorithms for extracting the MPD and MPP can be 
found in [16,17]. 
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Fig. 6. Exact match as a function of sub- Fi§- Probability mass discounted from 

tree depth upper bound (12n4, MPD) subtrees as a function of depth upper 

bound (I2n4) 



words and 4 substitution nodes (Z2n4). This yields small subtree spaces: for depth 
7, this corresponds to 172,050 subtrees. For all depth upper bounds, BF-DOPrf 
achieved the best results followed by DOPrf, DOPbou and BF-DOPbou- BF- 
DOPrf improved on DOPrf by 1.72 percent points at depth 6, an increase of 
2.02%. This corresponds to an error reduction of 11.4%. When compared to 
DOPbou, error reduction rose to about 15%. F-score results followed the same 
pattern. At depth 6, BF-DOPrf reached 95.33%; DOPrf, 94.73%; DOPbou, 
94.5% and BF-DOPbou, 94.33%. Error reduction of BF-DOPrf with respect 
to DOPrf reached 11.3% and with respect to DOPbou, 15.7%. 

The good perfomance of BF-DOPrf over DOPrf niay be explained by its 
reduced bias towards large subtrees. BF-DOPbou’s poor perfomance, on the 
other hand, is a result of increasing even further DOPbouS bias towards smaller 
subtrees. The lesson here is: if one has to choose between biased estimators, 
choose the one favoring larger subtrees; they are able to capture more linguisti- 
cally relevant dependencies. 

Probability Mass Transfer: Figure 7 shows discounted probability mass as a 
function of subtree depth upper bound. The probability mass discounted from 
2- word subtrees is bigger than the mass discounted from 1-word subtrees^®. This 
happens because the number of hapax legomena (subtrees that occur just once) 
tends to increase for higher d and I upper bounds, since more large subtrees with 
rare word combinations are allowed into the distribution. More hapax legomena 
results in higher discounting rates according to the Good- Turing method. Thus, 
the probability mass discounted from n-word subtrees is, in general, bigger than 
the mass discounted from (n-l)-subtrees. Consequently, the magnitude of the 
probability transfer across Markov orders gradually decreases as the recursive 
estimation procedure approaches atomic subtrees. The property of decreasing 



10 



n-word subtrees are subtrees having exactly n words in their leaf nodes. 
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discounts avoids the pitfall of overestimating small subtrees (c.f. DOPson) and, 
at the same time, reduces the overestimation of large subtrees (c.f. DOPrf). 

Most Probable Parse and Derivation: The next set of experiments used distri- 
butions obtained with parameters 17n2L3. These settings allow for testing the 
models quickly, since they generate relatively small subtree spaces (133,308 sub- 
trees for depth 7). Longer cascade effects can also be observed, since the models 
can backoff from 7-word subtrees to 0-word ones. Figures 8 and 9 show exact 





Fig. 8. Exact match as a function of ^ function of depth up- 

depth upper bound (17n2L3,MPD) P®'' ^ound (17n2L3, MPP, Monte Carlo 

sample size: 5,000) 



match results as a function of depth upper bound and maximization entity. Note 
that they follow the same pattern observed in the results with 12n4. Unlike those, 
however, maximum accuracy is achieved here at depth 3, not 6. 

MPP has better performance than MPD. This shift is stronger for DOPrf 
and BF-DOPbou- One possible explanation is that MPP allows for recovering 
part of the joint dependencies lost by the independence assumption underlying 
MPD. Moreover, contrary to MPP, MPD assumes that the probability of a parse 
is concentrated in a single derivation, which might lead to wrong results. 

Consistency Across Splits: To test whether the results above are due to some 
random property of the train/test split, experiments were carried out with fixed 
parameters 17n2L3d3 and varying train/test splits. The number of trees was kept 
constant: 9,049 for training; 1,000 for testing. Depth 3 was chosen because most 
models achieved their best results with this setting, which might indicate fluc- 
tuation. Four experiments were carried out. Experiment 1 (Expl) refers to the 
split used so far. Experiments 2, 3 and 4 (Exp2, Exp3, Exp4) refer to different 
splits obtained through random drawings from the OVIS corpus. Note that dif- 
ferent splits yield distinct models with different generative powers. Therefore, 
performance can greatly vary. 
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Fig. 10. Exact match for fixed depth 3 as a func- 
tion of training/test set splits (17n2d3L3,MPD) 



Figure 10 shows that the 
pattern previously observed, 
with BF-DOPrf achieving 
the best results and BF- 
DOPbou the worst, is con- 
served across the splits. The 
improvement, although persis- 
tent, is not statistically sig- 
nificant. BF-DOPrf’s perfor- 
mance mean of 87.10% is 
not significantly different from 
DOPrf’s mean of 86.54%, 
with 95% confidence accord- 
ing to the t-test (interval: 
±0.3097%). 



Exp3 was the only one that reached statistical significance. It is important 
to emphasize that the wide range of the confidence interval is due to the small 
sample size. Definitive conclusions can only be drawn once a larger number of 
experiments is carried out. Moreover, these results refer to a single ‘cut’ in the 
parameter space and it is possible that different parameter combinations will 
result in significant differences. In any case, these results do not contradict the 
relative ranking between BF-DOPrf and DOPrf. 



6 Conclusions 

The main point of this paper is that the DOP parameters constitute a hierarchi- 
cally structured space of highly correlated events. We presented a novel estimator 
for the DOP model based on this observation by expressing the correlations in 
terms of backoff. We provided empirical evidence for the improved performance 
of this estimator over existing estimators. 

We think that the hierarchical structuring of the space of DOP parameters 
can be exploited within a Maximum-Likelihood estimation procedure. The space 
structure can be seen to express some parameters as functions of other parame- 
ters. 

Future work will address (1) formal aspects of the new estimator (bias and 
inconsistency questions), (2) a Maximum-Likelihood variant for DOP that in- 
corporates the observations discussed in this paper, and (3) further experiments 
on larger treebanks, and less constrained DOP models. 
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Abstract. The usual numerical learning methods, that are primarily 
concerned with finding a good numerical fit to the data, often make 
predictions that do not correspond to the qualitative mechanisms in 
the domain of modelling or a domain expert’s intuition. Consistency 
of numerical predictions with a given qualitative model is helpful when 
a numerical model is used for explanation of phenomena in the mod- 
elled domain, but can also considerably improve numerical accuracy. In 
this paper we present a novel approach to numerical machine learning 
called Qfilter. Qfilter is a numerical regression method that can take into 
account qualitative background knowledge to give qualitatively faithful 
numerical prediction. The results on a set of domains including popula- 
tion dynamics show considerable prediction accuracy improvements com- 
pared to the usual numerical learners. As qualitative domain knowledge 
is often available in practice, Qhlter’s ability to exploit such knowledge 
should be beneheial in many applications. 



1 Introduction 

1.1 Qualitative Problems of Numerical Learning 

Methods of numerical machine learning, such as regression tree learniirg and lo- 
cally weighted regression (LWR), ofteir make predictions that a knowledgeable 
user finds obviously incorrect. A domain expert finds such errors incorrect not so 
much in numerical, but in qualitative terms. Often there are a priori known qual- 
itative constraints in the domain of application, and for numerical predictions 
to make sense, the predictions should be consistent with such constraints. 

For example, consider a coirtainer filled with water. Let there be an open 
draiir at the bottom of the contaiirer, aird water is draining out. Suppose we 
want to make predictions about the amount of water at various times. Although 
exact irumerical predictions of the amount may be hard, obviously these predic- 
tions will have to satisfy some qualitative constraints, such as: (1) the amount 
can never be negative, aird (2) the amount of water iir the coirtaiirer cair irever 
be iircreasiirg. Suppose that we have examples of measurements of the amomrt of 
water in time, obtained from past behavior of the draining process that started 
at different initial amounts. We then use standard methods of numerical machine 
learning to make predictions of the amount at future times, starting with some 
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new initial amount. Unfortunately, state-of-the-art numerical learning techniques 
will typically produce predictions that do not completely respect the above men- 
tioned qualitative constraints even when the learning data is noise-free. Sue et 
al. [1] give pertinent experimental results with the draining process, using M5 
regression and model trees [2] and LWR [3] (implementation in Weka; [4]). 

Such qualitative errors of numerical predictors are undesirable particularly 
because they make numerical results difficult to interpret. The underlying mech- 
anism in the domain is usually best explained in qualitative terms. However, this 
is obscured by qualitative errors in numerical predictions. 

1.2 Qfilter 

In this paper we introduce a numerical learning method, called Qfilter. Qfilter ac- 
cepts as input a set of numerical data points and a set of qualitative constraints, 
and performs numerical regression so that the predicted numerical values re- 
spect the given qualitative constraints. In a typical application of Qfilter, the 
qualitative constraints are provided by the domain expert as (qualitative) back- 
ground knowledge. Such constraints are typically part of domain knowledge. For 
example, in biological modelling the growth rate of a population is qualitatively 
proportional to the current size of the population and to the amount of food 
available to the population. Another possibility of applying Qfilter is within the 
learning, described below. In this context, qualitative constraints do not have 
to come from a domain expert. 

In Section 2 we define the type of qualitative constraints and qualitative 
trees accepted by Qfilter. In Section 3 we describe the Qfilter algorithm. Section 
4 presents experiments with Qfilter and a comparison with standard numerical 
prediction methods. Section 5 gives conclusions. 

1.3 Relation of Qfilter to Learning 

To rectify the qualitative problems of numerical learning. Sue et al. [1] proposed 
“qualitatively faithful quantitative learning”, called learning for short. In ex- 
periments with a complex industrial modelling problem, learning not only 
improved the predictions qualitatively, but also numerically. Numerical predic- 
tions with were considerably more accurate than those obtained with the 
mentioned numerical learning methods, 
learning consists of two stages: 

1. Induce a qualitative model from numerical examples. Program QUIN ([5,6]) 
that induces qualitative trees from numerical data can be used for this. 

2. Transform a qualitative model induced in stage 1 into a quantitative model 
(i.e. a numerical function) that fits well the given numerical data and is con- 
sistent with the qualitative constraints in the qualitative tree. This transfor- 
mation is called Q2Q transformation (qualitative-to-quantitative). 

The Q2Q transformation in [1] was based on piece-wise linear regression 
where these linear functions were determined heuristically using LWR on a grid 
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Fig. 1. A qualitative tree that describes the qualitative relations between class Z and 
attributes X and Y for the function Z = — Y^. The rightmost leaf, applying when 

attributes X and Y are positive, says that Z is strictly increasing in its dependence on 
X and strictly decreasing in its dependence on Y . 



of selected points. Although this method worked well in the experiments, it was 
ad hoc in that there was no guarantee that the so obtained heuristic functions 
would completely respect the qualitative constraints. The Qfilter approach in- 
troduced in this paper is a better founded and better performing method for 
Q2Q transformation. 

2 Qualitative Trees for Knowledge Representation 

In this section we describe a formalism for the representation of qualitative back- 
ground knowledge. We represent qualitative knowledge in the form of so-called 
qualitative trees that are described below and proved to be useful and under- 
standable in several different applications [5,7,1]. In these applications qualita- 
tive trees were induced from numerical examples by program QUIN [5,6]. In this 
paper we assume that they are given, e.g. defined by a domain expert, and study 
the advantages in terms of prediction accuracy. 

Qualitative trees are similar to decision trees but model qualitative relations 
between the class and the attributes. As in decision trees, the internal nodes in a 
qualitative tree specify conditions that split the attribute space into subspaces. 
In a qualitative tree, however, each leaf specifies a region in the attribute space 
where some monotonicity constraints hold. These monotonicity constraints are 
represented by what we call qualitatively constrained functions (QCFs for short). 
A simple example of QCF is: Y = M+(X). This says that U is a monotonically 
increasing function of X. In general, QCFs can have more than one argument. 
For example, Z = M~^’~ (X,Y) says that Z monotonically increases in X and 
monotonically decreases in Y. We say that Z is positively related to X and neg- 
atively related to Y . If both X and Y increase, then according to this constraint, 
Z may increase, decrease or stay unchanged. In such a case, a QCF cannot make 
an unambiguous prediction of the qualitative change in Z as explained below. 

Figure 1 gives an example of a qualitative tree. This qualitative tree is a 
qualitative model of the function Z = X^ — Y^ and describes how Z qualitatively 
depends on attributes X and Y. The tree partitions the attribute space into four 
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regions that correspond to the four leaves of the tree. A different QCF applies 
in each of the leaves. The QCF Z = F) applies in the rightmost leaf, 

where both X and Y are positive. 

Qualitatively constrained functions are inspired by the qualitative propor- 
tionality predicates Q+ and Q_ as defined by Forbus [4] and are also a gener- 
alization of the qualitative constraint M+, as used in QSIM [9]. A QCF con- 
strains the qualitative change of the class variable in response to the qualitative 
changes of the attributes. Namely, a QCF Si S {+, — } represents 

an arbitrary function 5?"* i— > 5ft with m continuous attributes that respects the 
qualitative constraints given by signs s^. The qualitative constraint given by sign 
Si = -|- (si = — ) requires that the function is strictly increasing (decreasing) in 
its dependence on the i-th attribute. We say that the function is positively re- 
lated (negatively related) to the Ath attribute. represents any function 

which is, for alH = 1, . . . , m positively (negatively) related to the f-th argument, 
if Si = -I- (si = -). 

Note that the qualitative constraint given by sign Si = -I- only states that 
when the z-th attribute increases, the QCF will also increase, barring other 
changes. It can happen that a QCF with the constraint Si = -I- decreases even 
if the z-th attribute increases, because of a change in another attribute. For ex- 
ample, consider the behaviour of gas pressure in a container given by equation 
Pres X Vol/Temp = const. We can express the qualitative behaviour of gas 
pressure by QCF Pres = M~^’~{Temp,Vol). This constraint allows that the 
pressure decreases even if the temperature increases, because of a change in the 
volume. Notice however, that the qualitative behaviour of gas is not consistent 
with the constraint Pres = M^iTemp). 

QCFs are concerned with qualitative changes. Qualitative change Qi in the 
z-th attribute is the sign of change in that variable. This can be either positive, 
negative or zero change. For simplicity, we ignore zero changes in the next para- 
graphs. QCF-prediction P{si,qi) is the qualitative change of the class variable 
predicted according to a single (z-th) attribute. QCF-prediction is positive if Si 
and Qi are both positive or both negative, and is negative otherwise. Qualita- 
tive ambiguity, i.e. ambiguity in the class’s qualitative change appears whenever 
there exist both positive and negative QCF-predictions according to different 
attributes. A qualitatively constrained function is consistent with a pair of ex- 
amples if there exists an attribute whose QCF-prediction is equal to the class’s 
qualitative change. We say that this example pair is QCF-consistent. A qualita- 
tively constrained function is consistent with a set of examples if it is consistent 
with all possible example pairs, i.e. when all possible examples pairs are QCF- 
consistent. 

3 Qfilter 

3.1 Idea and Example 

Here we describe algorithm Qfilter that, given a set of examples and a qualita- 
tive tree, adjusts the class values in such a way that they are consistent with 
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Fig. 2. Achieving consistency with QCF C = M^{A): since the QCF requires that 
class C is strictly increasing in attribute A, class values d (denoted by circles) are 
changed into Ci + di (denoted by crosses) by minimizing the sum of squared changes 
di. The arrows denote the class changes di. 



the qualitative tree. Namely, Qfilter is an optimization procedure that finds the 
minimal required quadratic changes in class values to achieve qualitative consis- 
tency with the qualitative tree. In one respect Qfilter can be viewed as a filter 
that smooths the data and removes the qualitative errors introduced by the 
measurement errors, but here we use it to remove the qualitative errors made 
by numerical predictors, as it will be explained later. 

Let us first observe a simple example illustrated in Figure 2. We have eight 
examples (a^, c,), i = 0, 1, .., 7 described with the values of class C and attribute 
A. The examples are not consistent with the given QCF C = M+(A), because 
the QCF requires that c^+i > Ci which is violated at i = 1 and i = A. 

To achieve consistency with C = M+(A), class values should be changed into 
Ci di, where the unknown parameter di denotes the change in i-th class value. 
Class changes di are constrained by QCF imposed inequalities: Ci+i + c?i+i > 
Ci + di where i = 0,1,. .,6. This gives the optimization problem that can be 
formulated in matrix notation by writing the inequalities as A d > b, where 
d is a vector of unknown parameters di, vector b has elements bi = Ci — Ci+i, 
and matrix A has elements Ui^i = —1, = 1 and zeros elsewhere. Therefore 

finding minimal quadratic changes in class values that achieve consistency with 
a given QCF can be posed as the optimization problem: 

find vector d that minimizes d^H d , , 

( 1 ) 

such that A d > b 

In the above formulation matrix H is the identity matrix. In general H can be 
changed to differently penalize the changes in class values. For example, we could 
change H to require that the classes of examples that have higher confidence 
are changed less. The above stated optimization problem is a kind of quadratic 
programme and can be efficiently solved by a number of methods. We used a 
quadratic programming solver in Matlab [10,11,12]. Since the criterion function 
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d^H d with diagonal matrix H is a convex function, and because the linear 
constraints A d > b define a convex hull, any local minimum of the criterion 
function is a globally optimal solution. 

Note that in the above formulation the values of attribute A are not men- 
tioned at all. However the ordering of attributes values, i.e. a^+i > a^, was used 
together with QCF C = M^{A) to set the ordering of class values, i.e. c^+i > Ci. 
A different ordering of attribute values would require a different ordering of class 
values, depending also on the QCF. When more than one attributes are used in 
a QCF, finding an appropriate ordering of class values is not as trivial as in the 
example above and is explained in the next section. 

3.2 Details of the Qfilter Algorithm 

Qfilter handles each leaf of a qualitative tree separately. It first splits the ex- 
amples according to the qualitative tree and then change class values to achieve 
consistency with a QCFs in the corresponding leaf. As mentioned above, Qfilter 
uses the attributes’ values and the QCF to find an appropriate ordering of class 
values, poses the optimization problem in the form given by Equation 1, and 
solves it by using quadratic programming methods. To explain how to find an 
appropriate ordering of class values we first define two useful terms, and then 
explain some properties of QCFs. Since in general not all of the attributes ap- 
pear in a QCF, we call an attribute that appears in a QCF a QCF-attribute. 
We call a QCF that doesn’t have a negative dependance on an attribute, i.e. is 
positively related with all QCF-attributes, a pure-positive QCF. 

The first interesting QCF property is that an arbitrary QCF can be, by 
appropriate changes in attributes, replaced by a pure-positive QCF. It is easy 
to check that a QCF that is negatively related with attribute Ai and positively 
related with all other attributes, is equivalent to a pure-positive QCF, where 
the attribute Ai is replaced by Ai = —Ai. For example QCF M~^’~ {Ai, A 2 ) is 
equivalent to QCF M+’+(Ai, A 2 ), where A 2 = —A 2 . Therefore we can simply 
multiply by minus one (or any other negative number) all the negatively related 
QCF-attributes to get a pure-positive QCF. Actually multiplying by minus one 
all negatively related QCF-attributes is the first step in Qfilter. In the rest of 
this section we assume that a given QCF is a pure-positive QCF. 

The second interesting QCF property is that a QCF defines a partial ordering 
of class values and that a pure-positive QCF is consistent with a set of examples 
if the class value of every example e is greater than the class values of every 
example from a set called Fgg{e). More formally, we show that a pure-positive 
QCF is consistent with a set of examples if, and only if: 

Ve, / £ Examples : / G Fsg{e) Ce > Cf (2) 

where Fgg{e) is the set of examples fsg that are smaller than e in every QCF- 
attribute and there is no example fs that is in every attribute smaller than 
e and in every QCF-attribute greater than fsg. This is explained in the next 
paragraphs. We show this by using the QCF consistency criterion that requires 
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that all possible example pairs are QCF-consistent. A first simplification is that 
we do not need to check all possible example pairs. A pure-positive QCF is 
consistent with a set of examples if every example e is QCF-consistent with all 
the examples fg that are smaller than e in every QCF-attribute. The set of such 
examples is denoted by Fg{e). Namely, example pairs of e and examples famb that 
are in one QCF-attribute grater than e, and in another QCF-attribute smaller 
than e are QCF-consistent with any pure-positive QCF. QCF-consistencies of 
example pairs of e and examples fg, that are in all QCF-attributes grater than 
e, are checked when examples fg are checked. Since the relation “is smaller than 
e in every QCF-attribute” is transitive, we need to check the consistency only 
for the “largest” examples from Fgg{e) C Fg{e), i.e. examples fsg G Fg{e) with 
the property that in Fg there is no example that is in every QCF-attribute 
greater than example fgg. Since all the examples from Fgg(e) are smaller than 
e in every QCF-attribute, a pure-positive QCF requires that the class value of 
example e is greater than the class value of every example from the set Fgg(e), 
i.e. V/ G Fgg{e) : Cg > Cf. 

The ordering of class values given by Equation 2 is used to set the inequality 
constraints, i.e. matrix A and vector b from Equation 1. For each / G Fgg(e) we 
add one (say i-th) constraint Ce + de > Cf + df, therefore vector b has elements 
bi = Cf — Ce, and matrix A has elements aij = —1, = 1 and zeros elsewhere. 

3.3 Qfilter for Numerical Prediction 

The basic idea of using Qfilter for numerical prediction is to apply it, with a given 
qualitative tree, on predictions of an arbitrary numerical learner. A numerical 
predictor is usually trained on a set of learning examples, where “correct” class 
values are given. For this reason it is quite natural to provide the learning ex- 
amples also to Qfilter. In this case Qfilter is supplied with the learning examples 
with “correct” class values together with test examples with predictions of class 
values. Qfilter then adjusts the class values of both learning and test examples 
to fit a qualitative tree. It is quite obvious that using also learning examples 
usually helps Qfilter. This is especially evident when adjusting a prediction of a 
test example that is close to some learning examples. 

One possible improvement of Qfilter is to also use the confidence estimate in 
numerical prediction if it is provided by the numerical predictor. In this case, 
Qfilter would change the class values with higher confidence less, for the price 
of bigger changes of class values that have lower confidence. This is achieved by 
simply changing matrix H in Equation 1 from identity to a diagonal matrix with 
hif = Wi, where weight Wi is computed from predictor’s confidence estimate in 
z-th class value. Of course, the computation of weight Wi depends on the type 
and scale of confidence estimate, but would generally be smaller if a numerical 
predictor is more confident in z-th class prediction. 
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Fig. 3. Qualitative trees used with Qfilter in domains RoboYl, RoboY2atrYl and in 
the population dynamics domain ZooChange. Note that 1.57 in the first two trees is an 
approximation of ^ where sin($) changes from an increasing to a decreasing function. 



4 Experimental Results 

Here we compare the numerical accuracy of locally weighted regression [3] (LWR) 
and Qfilter. Qfilter was used to adjust the LWR predictions according to given 
qualitative trees. We used a standard procedure to optimize LWR. Namely, LWR 
optimized the Gaussian kernel width that is used to weigh the neighbor exam- 
ples according to the mean squared local cross validation error at the point of 
prediction. 

We experimented with different learning set sizes and different noise in class 
variable. We used normally distributed zero-mean noise. Noise percentage p % 
means that the standard deviation of noise is dc p/100 where dc denotes the dif- 
ference between maximal and minimal class value. First we describe experiments 
in four artificial domains and then experiments in a more complex population 
dynamics domain. 

4.1 Artificial Domains 

Here we describe experiments in four artificial domains. The first is the domain 
called Quad with attributes X and Y and class Z = — Y^. Attributes X and 

Y are uniformly distributed between -10 and 10. Qfilter used LWR predictions 
and the qualitative tree given in Figure 1 and explained in Section 2. 

The second set of domains consists of three domains, called RoboYl, RoboY2 
and RoboY2atrYl. Here we model a planar two-link, two joint robot arm. The 
angle in the shoulder joint is denoted by d>i and the angle in the elbow joint is 
denoted by <? 2 - Angle is between zero and tt, while <p 2 is between — tt/ 2 and 
-K 12. When the arm is in horizontal position and are both zero. The first 
link, i.e. the link from shoulder to elbow, is extendible with length L\ ranging 
from 2 to 10. The second link has fixed length L 2 = 5. The first learning problem 
is to predict y-coordinate of the first link end, i.e. Y1 = Li sin(^i). This problem 
is called RoboYl. For Qfilter we used the qualitative tree given in Figure 3. 

The second learning problem is to predict y-coordinate of the second link end, 
i.e. Y2 = Li sin(^i) -|- 5 sin(<?i + ^> 2 ). Here we helped the learners with a derived 
attribute d>sum = + d> 2 , i.e. the deflection of the second link from the hori- 

zontal. We experimented with two versions of this learning problem. In domain 
RoboY2 we used the attributes Li, <l>i, d >2 and d>sum- In domain RoboY2atrYl 
we also used the correct Tl as an attribute. We generated examples where an- 
gles <Pi and <p 2 and link length Li are uniformly distributed in their possible 
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Fig. 4. Noise curve in domain RoboYl with 100 learning examples: on x- axis is 
noise percentage and on i/-axis is LWR (line with circles) and Qfilter (dotted line with 
triangles) mean squared error. 



ranges. Qualitative tree used with domain RoboY2atrYl is given in Figure 3. 
Qualitative tree used with domain RohoY2 has four leaves, with the same root 
node as qualitative tree for domain RoboY2atrYl, but with Y1 replaced by the 
qualitative tree for Yl. 

We experimented with different learning set sizes and different noise in class 
variable. We used a test set of 200 examples without noise. Table 1 gives the 
comparison of LWR and Qfilter mean squared errors (MSE) with 100 learning 
examples and various noise levels. All the results are averages on 10 sets of 
randomly selected learning and test examples. With all four learning problems 
the improvement of Qfilter with respect to LWR is obvious. Qfilter usually re- 
duces LWR MSE by more than 20 %. The MSE reduction usually increases with 
increased noise. Figure 4 shows a typical noise curve in domain RoboYl. 

We also experimented with different learning set sizes. For an illustration, we 
give the results with learning from examples with no noise in domain RoboYl 
in Table 1. When we used only 10 or 20 learning examples the Qfilter reduction 
of error is relatively small, since none of the learners is able to generalize well 
from such a small learning set. But as the learning set increases, Qfilter can take 
advantage of given qualitative knowledge. After a certain learning set size, the 
reduction of error decreases with increasing learning set. However, the reduction 
in error is usually still visible even when we use relatively large learning set. Of 
course this depends on the difficulty of the domain. When a numerical learner 
gives predictions that are consistent with a given qualitative tree, Qfilter does 
not change them. 



4.2 Population Dynamics Domain 

The last domain models a dynamic behavior of an aquatic ecosystem that in- 
volves populations of zooplankton and phytoplankton, and inorganic nutrient 
nitrogen that are denoted by variables Zoo, Phyto and Nut, respectively. The 
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Table 1. Comparison of LWR and Qfilter accuracy. The first table gives MSE with 
100 learning examples and various noise levels in all the described domains. Since the 
changes in zooplankton are small, the values of MSE given for the domain ZooChange 
are multiplied by 10®. The second table gives MSE in domain RoboYl when different 
number of learning examples with no noise were used. All the results are averages on 
10 sets of learning examples. 



Domain 


class 


no noise MSE 


5 % n. MSE 


20 % n. MSE 


name 


variable 


LWR; Qfilter 


LWR; Qfilter 


LWR; Qfilter 


Quad 


Z = X'^ -Y'^ 


98.4 ; 84.7 


149.3 ; 114.6 


765.6 ; 554.7 


Robo Y 1 


Y\ = L\ sin(^i ) 


0.298 ; 0.196 


0.407 ; 0.280 


1.924 ; 1.367 


RoboY2 


Y2 = Li sin(fPi)-|-5sin(<5sum) 


2.618 ; 2.305 


3.078 ; 2.612 


6.823 ; 5.167 


RoboY2atrYl 


Y2 as above, using attr. Y1 


0.940 ; 0.691 


1.324 ; 0.968 


3.665 ; 2.707 


ZooChange 


ZooChit) =Zoo(t-\-l) — Zoo{t) 


0.015 ; 0.008 


0.112 ; 0.102 


2.269 ; 1.889 


Domain 


20 learn, ex. MSE 


50 l.ex. MSE 


100 l.ex. MSE 


300 l.ex. MSE 


name 


LWR; Qfilter 


LWR; Qfilter 


LWR; Qfilter 


LWR; Qfilter 


Robo Y 1 


3.690 ; 3.421 


1.201 ; 0.933 


0.298 ; 0.196 


0.019 ; 0.018 



model assumes closed ecosystem with no inflow and consists of two consump- 
tion interactions. Namely, phytoplankton consumes nitrogen, and zooplankton 
consumes phytoplankton. This results in complex time behavior of the variables. 

Our learning task is to predict the change in zooplankton ZooChange(t), i.e. 
the difference between the zooplankton population at the next and the current 
time point (ZooCh(t) = Zoo(t+ 1) — Zoo{t)), given the values of zooplankton, 
phytoplankton, and nutrient at the current time point. We used experimental 
data that was kindly provided by Ljupco Todorovski and Saso Dzeroski who pre- 
viously experimented in this domain and give a more elaborate description of the 
domain [13]. The experimental data was generated by the following differential 
equations model: 



Nut = 2 — Phyto Nut 

vhN n 1 Phyto Phyto , ^ -y uh at ^ n ^ Phyto 

Phyto = 0.1 h 0.7 PhytoNut - 0.5 — — — 

7 5 Phyto + 0.5 



Zoo = —0.1 Zoo + 0.25 



Zoo Phyto 
Phyto + 0.5 



(3) 



In contrast to other experimental domains we do not use a qualitative model 
that would completely correspond to the actual numerical behavior of the pop- 
ulation dynamics model. Instead we use a heuristic qualitative tree given in 
Figure 3. This qualitative tree was obtained by qualitative abstraction of Zoo 
in Equation 3 and assumes constant values of the variables between the cur- 
rent and the next time point. It is just an approximate qualitative model an 
expert might give and has the following interpretation. Since zooplankton feeds 
on phytoplankton, a larger phytoplankton population enables a bigger positive 
change in zooplankton. The change of zooplankton is also positively related to 
the zooplankton population, since the growth rate of a population is positively 
related to the size of the population. But if the phytoplankton population is too 



Improving Numerical Prediction with Qualitative Constraints 395 



small (below 0.33 in qualitative tree in Figure 3) to provide enough food for 
zooplankton, then the change in zooplankton will be negatively related to the 
zooplankton population. 

The data consists of ten traces generated by simulating a numerical model 
from ten randomly chosen triples of starting values for variables Zoo, Phyto 
and Nut. Each simulation lasts for 100 time steps and gives 100 examples, each 
example being described with attributes Nut{t), Phyto{t) and Zoo{t). The class 
variable ZooCh was computed as the difference in zooplankton population be- 
tween two consecutive points in time, i.e. ZooCh{t) = Zoo(t -I- 1) — Zoo{t). The 
learning examples were randomly selected from the first five traces, and the test 
examples were randomly selected from second five traces. We used 100 learning 
and 100 test examples. The results in Table 1 are averages of learning from ten 
random selections of examples. These results show that even an approximate 
qualitative model can help Qfilter to improve numerical accuracy. 

In the experiments with Weka [4] implementation of M5 regression and model 
trees [2], qualitative errors were even more obvious, as illustrated also in [1]. For 
this reason the accuracy improvements of Qfilter with respect to model and 
regression trees were usually bigger. 

5 Conclusions 

We presented a novel approach to numerical machine learning called Qfilter. 
Qfilter is a numerical regression method that can take into account qualitative 
background knowledge expressed as a qualitative tree with qualitatively con- 
strained functions in the leaves of the tree. As qualitative domain knowledge is 
often available in practice, Qfilter ’s ability to exploit such knowledge should be 
beneficial in many applications. One desirable consequence of using such qualita- 
tive knowledge is improved accuracy of numerical predictions. Another desirable 
property is that the resulting numerical regression model is qualitatively consis- 
tent with known qualitative relations in the domain of application. 

There are several directions in which Qfilter can be extended. As noted in 
Section 3.3, a possible improvement is to use the confidence estimate in nu- 
merical prediction provided by the numerical predictor to change less the class 
values that have higher confidence estimate. Experiments with using the size of 
confidence intervals provided by LWR show that this can additionally improve 
Qfilter accuracy. Qfilter as presented in this paper, requires a numerical learner 
and it does not provide an explicit model. However, the quadratic programming 
approach can easily be extended to induce a piecewise linear model that is con- 
sistent with a given qualitative model. Another interesting point is that Qfilter 
finds minimal sum of squared changes of class values to achieve consistency with 
a given qualitative model. In this respect it gives the error of numerical data 
w.r.t. qualitative model or vice versa and provides a bridge between qualitative 
and numerical models. 

In the experiments in several domains, Qfilter always improved the accuracy 
of numerical predictions compared to standard regression methods. Improve- 
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merits in accuracy were observed even in cases when the qualitative constraints 
applied were only approximate. In the experiments, the improvements were ob- 
served consistently when varying the amount of learning examples and the degree 
of noise in the data. In this paper we assumed that qualitative trees are given. 
An appealing alternative would be to use induced qualitative trees. QUIN, de- 
pending on the noise, often induced similar qualitative trees as used here. 
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Abstract. Determining the semantic role of sentence constituents is a key task in 
determining sentence meanings lying behind a veneer of variant syntactic expres- 
sion. We present a model of natural language generation from semantics using 
the FrameNet semantic role and frame ontology. We train the model using the 
FrameNet corpus and apply it to the task of automatic semantic role and frame 
identification, producing results competitive with previous work (about 70% role 
labeling accuracy). Unlike previous models used for this task, our model does 
not assume that the frame of a sentence is known, and is able to identify null- 
instantiated roles, which commonly occur in our corpus and whose identification 
is crucial to natural language interpretation. 



1 Introduction 

A central goal of natural language processing is domain-independent understanding. 
A useful step towards that goal is the assignment of semantic roles to the (syntactic) 
constituents of a sentence. Having semantic roles allows one to recognize semantic ar- 
guments of a situation, even when expressed in different syntactic configurations. For 
example the role of an instrument, such as a hammer, can be recognized, regardless of 
whether its expression is as the subject of the sentence {the hammer broke the vase) or 
via a prepositional phrase headed by with. This paper attempts the task of learning to 
automatically assign such roles. Identifying such roles and the relationships between 
them can in turn serve as support for inference about a sentence’s meaning, for an- 
tecedent resolution, or for other understanding or parsing tasks such as prepositional 
phrase attachment or word sense disambiguation. 

This paper develops a generative model from which one can infer role labels, given 
sentence constituents and a word from that sentence that is a predicator, which takes se- 
mantic role arguments. We learn the parameters for this model from a body of examples 
provided by the FrameNet corpus [1]. The problem and some elements of our approach 
are similar to that of [2], but the work differs by use of a generative, not a discrimina- 
tive, model, and by assuming less known information for making the role assignment. 
A difficulty of this task is that there is limited data available annotated with semantic 
roles, in comparison to syntactic parsing. As an illustration of this, in the model de- 
veloped by [2] the most accurate rules only covered 50% of the unseen examples. To 
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overcome the limited amount of training data, we would ultimately like to apply boot- 
strapping, in which limited labeled data are combined with unlabeled data to produce a 
more accurate model than that trained on unlabeled data alone [3, 4], Generative models 
are a natural choice in the case of combining fully and partially annotated data. First, 
we need to test their capabilities on fully annotated data, such as it exists. This is the 
focus of the current paper. 

Our work can be compared and contrasted with much past work in information ex- 
traction [5-7], in which the goal is to extract from text words or phrases that fill a role, 
such as “acquiring company” or “vehicle,” and in which there are often multiple roles of 
interest. In particular, recent work such as [5] uses Hidden Markov Models, including 
induction over the structure of the model, for the labeling task. The model we use is sim- 
ilar, but while our goal is also to identify which roles are filled, and identify the words 
that fill them, we additionally aim to identify the overarching relationship that holds 
between the roles. We call this relationship the frame. Secondly, information extraction 
normally uses a small number of very domain specific roles, while our corpus has a 
large number of roles, with many types of roles that apply across domains. The tech- 
niques of information extraction may not scale well to large numbers of roles. Also, in 
information extraction, the labeling task is somewhat tied, semantically, to the domain 
at hand. These methods also tend to rely on regular structure, such as capitalization 
or indicator terms drawn from a closed class. Finally, the currently annotated semantic 
data is primarily at the sentence level, versus entire texts for information extraction. 

The acquisition of selectional preferences, or the tendency of verbs to prefer argu- 
ments of a particular type, is a second closely related area [8, 9]. In this line of research 
statistical models are typically trained on parsed sentences to determine verb-subject or 
verb-direct object relationships. Such information can be useful for prepositional phrase 
attachment or to help determine the semantic class of a previously unseen word. 

In this paper, we show that our generative model for role labeling produces results 
competitive with previous work in this area. In addition, our model is flexible enough 
to be used for annotating additional data, thus improving the model and the pool of 
data available for other researchers. Second, it has the advantage of capturing the case 
when roles are null instantiated in a particular sentence: they are not overtly expressed 
but their presence is understood implicitly in discourse. While our model handles these 
roles, we leave to future work a full evaluation of this ability. Finally, it can identify 
which constituents correspond to role labels of a particular given predicator. 

2 Background 

In this section we discuss the FrameNet Corpus, the previous work on labeling roles by 
Gildea and Jurafsky, and the role labeling task in more detail. 

2.1 The FrameNet Corpus 

FrameNet [1] is a large-scale, domain-independent computational lexicography project 
organized around the motivating principles of lexical semantics: that systematic correla- 
tions can be found between the meaning components of words, principally the semantic 
roles associated with events, and their combinatorial properties in syntax. This principle 
has been instantiated at various levels of granularity in different traditions of linguistic 
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research; FrameNet researchers work at an intermediate level of granularity, termed the 
frame. Examples of frames include MOTION.DIRECTIONAL, CONVERSATION, JUDG- 
MENT, and Transportation. Frames consist of multiple lexical units — a items cor- 
responding to a sense of a word. Examples for the M0TI0N_DIRECTI0NAL frame are 
drop and plummet. Also associated with each frame is a set of semantic roles. Examples 
for the Motion_directional frame include the moving object, called the Theme; 
the ultimate destination, the GOAL; the SOURCE; and the Path. 

In addition to frame and role definitions, ErameNet has produced a large number of 
role-annotated sentences; the sentences are drawn primarily from the British National 
Corpns. There are two releases of the corpus, ErameNet I and ErameNet II'; we present 
results from both, but have so far focused primarily on the former. For each annotated 
example sentence, a lexical unit of interest, one which takes arguments, is identified. 
We will call this word the predicator^ . The words and phrases which participate in 
the predicator’s meaning are labeled with their roles, and the entire sentence is labeled 
with the relevant frame. Finally, the corpus also includes syntactic category information 
for each role. We give some examples below, with the frame listed in braces at the 
beginning, the predicator in bold, and each relevant constituent labeled with its role and 
phrase type. Note that the last example has a DRIVER role that is null instantiated. 

{Motion_directional} Mortars lob heavy shells high into the sky so that 
[T^MEthey] drop [^^THdown] [^*oALon the target] ‘he sky]. 

{Arriving} He heard the sound of liquid slurping in a metal container as 
[rLMEpairell] approached [s^Rcsfrom behind]. 

{Transportation} [dr,ver] [cargo The ore] was boated [qoal down the 
river] . 

Our focus here is on the FrameNet corpus, but another semantically annotated cor- 
pus is under development, called the Proposition Bank [10]. This corpus, based on 
adding semantics to the Penn English Treebank, is projected to soon be larger than 
FrameNet, and involves comprehensive rather than selective annotation of a corpus. 
However, it does not incorporate the rich frame typology of FrameNet, and only a 
somewhat limited role typology; while roles are specified for each verb, there is no 
generalization across verbs. Finally, Proposition Bank labels only verbs, leaving nouns 
and adjectives for a later stage; FrameNet includes all three. Since we desire rich se- 
mantic information in preference to a large corpus, we use FrameNet annotations as our 
source of training data. Our methods, however, would generalize to Proposition Bank. 

2.2 Gildea & Jurafsky’s Discriminative Model 

Gildea and Jurafsky (2002) (henceforth, G&J) were the first to apply a statistical learn- 
ing technique to the FrameNet data. They describe a discriminative model for deter- 

’ Also, confusingly known as version 0.75 and version 1.0, respectively. 

^ What we call the predicator is called the target in the FrameNet theory, and what we are calling 
a (semantic) role is called in FrameNet a. frame element, while what we call a constituent or 
argument head, [2] call simply the head. We have found that most people find the FrameNet 
terminology rather confusing, and so have adopted alternative terms here. 
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mining the most probable role for a constituent given the frame, the predicator and 
some other features whose description we defer until later in the paper. They evalu- 
ate their model on a pre-release version of the FrameNet I corpus, which at that time 
contained about 50,000 sentences and 67 frame types. Their model was trained by first 
using the parser of Collins [11], and deriving features from that parse, the original sen- 
tence, and the correct FrameNet annotation of that sentence. Their work differs from 
ours in a number of important respects. Firstly, in all their experiments, they assume 
that the frame is already known, as well as the predicator of interest. While one could 
certainly imagine first determining the frame from the sentence (for example, one could 
use the model presented here to do that), their use of a discriminative approach makes 
it less straightforward to do joint inference over the choice of frame and semantic roles 
for constituents, as one would wish to do, whereas that is a natural thing to do within 
a generative model. Secondly, since their discriminative model assigns roles to con- 
stituents in the sentence, there is no natural way to handle unexpressed arguments, and 
they do not attempt to. But unexpressed arguments are common in natural languages, 
and again are naturally handled in a generative model. Moreover, most of their work 
assigns roles to constituents individually and independently. Later in their paper, they 
do develop and consider joint inference over all the semantic roles of a predicator, but 
this is more naturally done using the kind of model we present here. Finally, although 
this remains a promissory note, we believe that a generative model will be a better basis 
for extension via bootstrapping to unlabeled data. 

2.3 The Role Labeling Task 

With respect to the FrameNet corpus, several factors conspire to make the task of role- 
labeling challenging, with respect to the features available for making the classification. 
These results are likely to hold across other theories and methodologies for semantic 
role determination. The challenges also imply that constructing a hand-built semantic 
role identifier would prove a daunting task. First, it is not always predictable from the 
syntactic relationship between two phrases whether they stand in a semantic relation- 
ship. Second, many words that may participate in a role have a wide variety of possible 
roles in which they may participate. There are also many generic roles such as TIME 
and PLACE that can be indicated by almost any word. Third, the internal structure of a 
syntactic constituent is not always a good predictor of the role it receives. The prepo- 
sitional phrase in the hole, for example, can be a LOCATION, as in she sat in the hole, 
or a Goal of movement, as in she jumped in the hole. Finally, as mentioned earlier, in 
many cases roles are null instantiated, which is widespread in many languages; an En- 
glish example is passive sentences with no specified agent, such as the cake was eaten. 
Thus, the only evidence for the presence of such roles is contextual. 

With respect to the relationships between predicators, frames, and roles, further 
difficulties arise. A leading idea of FrameNet is that there is considerable variety to the 
semantic role types available in a particular event (for example, PERCEPTION events 
and COMMERCE events have very different participants). Thus, identifying the frame 
that is relevant for a particular sentence and predicator narrows the search for roles. 
However, many predicators are ambiguous with respect to their frame. Further, not all 
lexical units of a particular frame necessarily have the same distribution of roles. For 
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example, drop and plummet have lexical entries in the M0TI0N_DIRECTI0NAL frame, 
but Source is rare for plummet, yet quite common for drop. As a result, for the task of 
automatic role assignment a mixture of predicator-specific and frame-specific statistics 
are potentially useful to deal with sparseness of a particular predicator or role. 

3 A Generative Model for Sentence-Role Labeling 

Our goal is to identify frames and roles, given a natural language sentence and predi- 
cator. As discussed above, G(&J’s approach to this problem was to determine the most 
probable role for each constituent of the sentence, given the frame, the predicator and 
some other features. However, this does not capture null instantiation, or roles that are 
not reihed in the sentence. In addition, a model should ideally capture the relationships 
between frames and roles, determining which constituents are likely roles for which 
predicator. To address these concerns we turn to a generative model to determine the 
sequence of role labels for a sentence. In other words, our model defines a joint prob- 
ability distribution over predicators, frames, roles, and constituents. While the model 
is fully general in its ability to determine these variables, in this paper it is only tested 
on its ability to determine roles and frames when given both a list of constituents and 
a single predicator. The generative model, illustrated in Figure 1, functions as follows. 
First, a predicator, S, is chosen, which then generates a frame, F. The frame generates 




a (linearized) role sequence, Ri through which in turn generates each constituent 
of the sentence, Ci through C„. Note that, conditioned on a particular frame, the model 
is just a Hidden Markov Model. The sentence-to-constituent mapping is discussed in 
more detail in Section 3.1. 

The model is complicated slightly by the fact that some sentence constituents do 
not correspond to a labeled semantic role. We handle these constituents with an idea 
from machine translation: that of the null source. A second complication is the null 
instantiations, which are also captured by a null, but in this case it is the emission 
which is a null. Henceforth, null sources will be described by an Unk (unknown) role 
to avoid confusion with null emissions. We will discuss an example with an unknown 
role in Section 3.1, and gave an example of a null emission in Section 2.1. 
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The joint probability for a FrameNet example in this model is 

P{C,R,F,S) = P{S) X P{F\S) X P(R\F,S) x P(C|R,F,S'), 

where C is the vector of constituent heads, R is the role vector that generates them, F 
is the frame, and S is the predicator word. The third and fourth terms of this equation 
involve sequences. For the role sequence, we usually make a Markov assumption that 
each word’s role is dependent only on the previous role in the sequence. Thus: 

P(R|P,5) = 

i i 

where the Ri are the roles in the sequence. The Markov assumption has been effective 
in language modeling and tagging and so seems a good assumption to begin with. 

Finally, our basic model assumes that constituent emissions are independent of the 
frame and predicator given the sequence of roles, that each emission depends only on 
the role that generated it, and that constituents are independent of each other. Thus: 

P{C\F,S,R) « P(C|R) = l[P{C,m, 

i 

where Ci are the elements of C and Ri are the corresponding elements of R. This can be 
compared to a part of speech tagging model where words are independent of each other 
given the tags, and depend only on the tag in the same position in the sequence. The 
independence of the constituents and the frame and predicator given the roles seems 
quite reasonable, given that most roles are frame-specific, and the whole rationale of 
FrameNet is that frames are sufficiently fine-grained that roles for predicators inside a 
single frame behave similarly. Adding further dependencies might be expected to only 
exacerbate the problem of sparseness in the data. 

3.1 Training the Model 

The FrameNet corpus contains annotations for all of the model components described 
above. To simplify the model, we chose to represent each constituent by its phrasal 
category together with the head word of that constituent. Since the FrameNet anno- 
tations do not include head word information, we determined the heads using simple 
heuristics. This representation and the method of head-finding are familiar from the 
statistical parsing literature ([12]). This data then provides a set of constituents with 
correctly annotated roles for a given sentence, where it is known which constituents 
correspond to roles and what the appropriate predicator is for those roles. For example, 
for the example below, the training example would be: S'=rode; F=Transportation; 
i?i=DRlVER; C'i=Anne/NP; R 2 =Vehicle; C 2 =donkey/NP; i? 3 =AREA; C 3 =on/PP. 

{Transportation} “On 26th May [or,ver Anne] rode [vehicle^ donkey] 
[area beach],” the letter said . 

Most of the parameters for the model are estimated using a straightforward max- 
imum likelihood estimate based on fully labeled training data. Emission probabilities 
need to be smoothed, due to the sparseness of head words. During training, all words 
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seen only once are replaced by the phrase type label of the constituent of which they are 
a head. This gives a phrasal-class based model, which is itself smoothed with a uniform 
phrasal class prior, and the probability of generating unseen words belong to a certain 
class is estimated as simply a constant (representing P{word\class)) times the proba- 
bility of the phrasal class. Therefore, statistics are gathered both for the probabilities of 
roles generating each phrase type plus head combination, and there is a backoff model 
of roles generating a phrase type, and some unknown word within that type. 

In an actual semantic parsing application, it would not be known which constituents 
bear a role of which predicators. We could make use of a syntactic parse in determin- 
ing constituents that are candidates for roles. In a first approximation of this, we used 
a parser to determine constituents and their phrase types, and combined these with the 
FrameNet annotations. For this purpose, we restricted ourselves to training and test- 
ing on examples whose annotated predicator is a verb, since these are dealt with in a 
straightforward manner. The “sentence” level of the model in this case includes only the 
verb phrase whose head is the predicator, and its subject and arguments. If a constituent 
is identified in the parse but not in the FrameNet annotation, we label it as an Unk role. 
Again, this treatment is similar to the case of null emissions in a statistical machine 
translation model. For this format, the example above would have an additional role 
inserted at the beginning, with role=UNK and constituent=On/PP. 

3.2 Producing the Semantic Role Labels 

At inference time, the goal is to produce a sequence of role labels, given a sequence 
of constituents and a predicator. As just discussed, these constituents may be the head/ 
phrase-type pairs from the FrameNet data, or the head/phrase-type pairs that are the 
result of parsing a sentence in the corpus and extracting the verb phrase with its subject 
and arguments. The role-labeling procedure is dependent on the frame, itself a hidden 
variable at labeling time. If the frame were known, we could simply use the HMM 
Viterbi algorithm, with the roles as the hidden states and the constituent heads and their 
phrase type as the emissions. In that case, we would use transition probabilities from 
only the frame of interest. Because we currently add empty constituents for the null 
instantiated roles whether using parsing information or not, our Viterbi sequence is of 
the same length as the input constituent sequence. 

For the emission probabilities, there are two options, since a particular role can ap- 
pear in multiple frames. One option is to condition the emission probabilities also on the 
frame. That is, the emission probabilities are calculated from only those role/constituent 
pairs that originally appeared in the given frame. A second option is to calculate emis- 
sion probabilities for a role over all frames in the training data, since this arguably 
would provide more evidence and mitigate sparse data problems to some extent. FIow- 
ever, the second option also leads to a potential problem, that of words unseen in the 
given frame but seen as emissions of the role in other frames. We compare both options 
in the results. 

If the frame is not known, the more realistic case, then we have several options. We 
could just change the model and make the roles a combination of a role and a frame, 
but then the Viterbi sequence might change frames part way through, which seems 
unsatisfactory, given the intended semantics of the model. We could marginalize out 
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the frame variable. In practice, given that most roles are particular to individual frames, 
doing such a marginalization would probably give results little different to our current 
results, but this also seems conceptually wrong, since we’re wanting to do inference for 
the most likely frame and roles underlying a sentence. So instead we calculate the most 
probable conhguration of all the hidden variables. This generalized Viterbi algorithm is 
a straightforward instance of max-propagation algorithms for Bayesian networks [13]. 

For this case, this is equivalent to the less efficient operation of simply finding all 
frames with P(F|5') > 0, compute the role sequence probabilities given the transition 
probabilities for that frame and the emission probabilities across all frames, and then 
choosing the maximum product of the prior probability of the frame for the predicator 
and the probability returned by the HMM Viterbi algorithm. 



4 Experimental Results 

To test the above model, we trained it on annotated FrameNet data, randomly dividing 
the data into a training set and an unseen test set. Each frame was randomly split so that 
70% of its examples were in the training set and 10% were in the test set. We report 
on three types of accuracy. First, role labeling accuracy is the number of constituents 
correctly labeled. Since we label all constituents, this makes the familiar metrics of 
recall and precision equivalent. We micro-average by adding up the number of correct 
labels for all examples and dividing by the number of total labels for all examples, so 
this is not an average accuracy per-sentence, though we have done the calculations both 
ways, and for these experiments the two figures are quite close to each other. Second, we 
report the percent of sentences for which all roles are correctly labeled, or full sentence 
accuracy. Finally, frame accuracy is calculated as the proportion of sentences for which 
the correct frame was chosen based on the predicator. 

For a baseline comparison, we computed the accuracy of a zeroth-order Markov 
model, treating all transition probabilities between roles as uniform. We also computed 
the accuracy of choosing, for all constituents, the most common role given the predi- 
cator, and the accuracy of choosing the most common role given the frame, where the 
most common frame (arg maxp P{F\S)) for the known predicator is chosen. 

4.1 Results: Annotated Roles 

Our hrst set of experiments trained and tested our model from the correctly annotated 
sentences of the FrameNet corpus, together with constituent heads as determined by 
a parser. We performed most of our experiments on FrameNet I, but ran some experi- 
ments with FrameNet II as well^. The constituents’ heads were chosen by some simple 

^ We regard the FrameNet I results as broadly comparable with those of G&J, though the data 
sets are not exactly the same, and there are various other differences (we guess the frame 
whereas they assume it; except in parsing experiments, we use the phrasal category given in 
FrameNet, whereas they always use phrasal categories returned by a parser, even when using 
the constituent extent information given by FrameNet). We have recently obtained G&J’s data, 
and hope to provide a more precise comparison in future work. 
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Table 1. FrameNet I Experimental Results. Key: Role=Role labeling accuracy, Full=full sentence 
accuracy, Frame=Frame choice accuracy. Tm=Training Set, Tst=Test Set 



System 


Trn Role 


Tst Role 


Tm Full 


Tst Full 


Tst Frame 


FirstOrder 


86.1% 


79.3% 


75.4% 


65.3% 


97.5% 


ZeroOrder 


- 


60.0% 


- 


34.6% 


96.5% 


BasePredicator 


39.9% 


39.2% 


10.5% 


10.2% 


N/A 


BaseFrame 


37.8% 


37.6% 


9.2% 


9.5% 


N/A 



Table 2. FrameNet I Arg Max versus all Sequences 



System 


Role 


Full 


Frame 


First All 


79.3% 


65.3% 


97.5% 


First ArgMax 


77.2% 


63.2% 


94.8% 


Zero All 


60.0% 


34.6% 


96.5% 


Zero ArgMax 


58.8% 


33.4% 


94.8% 



heuristics, but their labels correspond to the Phrase Type labels from FrameNet. These 
tests are similar but not identical to the analysis in Section 4.2 of G&J. 

The first results are on 36,805 training sentences, containing a total of 82,169 con- 
stituents, and 5299 test sentences containing 11,833 constituents. There are 78 frames, 
139 possible role labels, and 1,385 predicators. We obtain 86.1% role labeling accu- 
racy on the training data, 79.3% on the test data. For full sentence accuracy, we obtained 
75.4% accuracy on the training data and 65.3% on the test data. Finally, the correct 
frame was chosen for 98.1% of training sentences and 97.5% of the test sentences. 
Table 1 summarizes these and the remainder of our results for this data set. We did 
not measure the training accuracy in the zeroth-order case. These results are roughly 
comparable to results of 78.5% on test data for G&J’s model on data with constituents 
marked, and they cite a similar result for BasePredicator of 40.6%. We can at least 
conclude that performance is similar. 

We also measured the benefit of exploring all sequences versus only the sequence 
for the frame with the highest probability given the predicator. The difference is shown 
in Table 2, for training accuracy only in the First Order and Zero Order case. The dif- 
ferences are about two percentage points in most cases. 

Our next set of results are on FrameNet II, where we evaluated only the ArgMax 
case. Training on 70% and testing on 10% resulted in a corpus of 89,900 training sen- 
tences and 12,990 test sentences. Here there are 282 frames, 423 possible role labels, 
and 4,712 predicators. The performance results on the test set, shown in Table 3, are 
somewhat weaker than for FrameNet I, but not overly so, considering the increased 
number of roles and frames. 

In analysis of the role labeling results, we noticed two major sources of error. The 
first is words unseen in a particular frame but not “rare” over the whole corpus. We 
could partially address this with a held-out mass for unseen words that is weighted by 
the prevalence of rare words of each phrase type. Second, some cases are just very 
difficult, for example, prepositions commonly heading more than one type of role can 
induce ambiguity, one example being Instrument/Manner ambiguity on vrit/z-marked 
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Table 3. FrameNet II Experimental Results 



System 


Role 


Full 


Frame 


FirstOrder 

ZeroOrder 


73.9% 

61.3% 


63.7% 

43.0% 


88.7% 

89.3% 



Table 4. Parse Model Experimental Results 



System 


Trn Role 


Tst Role 


Tm Full 


Tst Full 


FirstOrder 


81.0% 


70.1% 


58.1% 


39.5% 


ZeroOrder 


78.8% 


67.8% 


50.7% 


34% 


BasePredicatorParse 


35.4% 


33.2% 


1.0% 


0.7% 



roles. We also have difficulties with roles in frames such as Differentiation, which con- 
tains roles for Phenomena, Phenomenon!, and Phenomenon2, or Conversation, with its 
Interlocutors, Interlocutor 1, and Interlocutor2 roles. These roles are semantically simi- 
lar, and we would need a richer syntactic representation to differentiate them. 



4.2 Results: All Constituents 



In the next set of experiments, we evaluated the system, together with a parser, on the 
ability to both determine which constituents correspond to roles, and to label those 
constituents. To do so, we used our statistical parser [14] to parse only the sentences 
used in the previous section which have a verbal predicator. The parser was trained on 
Brown and about half of the Wall Street Journal. Our generative model was trained as 
described above, with the inclusion of Unk roles for constituents not corresponding to 
a labeled role. At role labeling time, the verb phrases as determined by the parser are 
presented to the model with (the labeled heads of) their subject and arguments, with 
the main verb as the predicator. The model now has the option of choosing Unk role 
labels. 

Because of the difficulty in matching parse constituents with their appropriate role 
labels in the annotated data, the size of the data set for these tests is considerably smaller 
than that above. We used only the verb phrases corresponding to known frames, but with 
the Unk roles included. There are are 13,782 training examples, 1,558 test examples, 
55 frames, and 980 different predicators. Also, there are 117 unique roles and 43,937 
constituents. On this task, the system obtained 81% role labeling accuracy on the train- 
ing set and 70.1% on the test set. Full sentences were considerably more difficult to get 
right, with 58.1% training accuracy and 39.5% test accuracy. Frame choice accuracy 
was 94.5% on the training data and 93.3% on the test data. These results are summa- 
rized in Table 4. The only figure G&J give for full sentence accuracy is 38% for a system 
that had to determine both which constituents correspond to roles, and what those role 
labels should be, which is again roughly comparable to our 39.5% performance on the 
test set. 
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4.3 Discussion 

Our model and these results can be compared and contrasted with those of G&J. Some 
of the features used by G&J are similar to those used by our model. Both models use the 
phrase type and head word of each constituent. Both models incorporate the predicator, 
but in different ways. Our model assumes the predicator is either explicitly given or 
assumes that each main verb in the sentence is a predicator. A future version could 
determine the probability that each head word is a predicator. 

In addition to these features, G&J introduce several other features. First, the Gov- 
erning Category determines for noun phrases, whether an S or VP most closely dom- 
inates the phrase. This feature may provide similar information to that given by our 
Markov chain. Second, their Path feature follows the parse tree from the predicator to 
the constituent, represented as the string of nonterminals encountered. The final two 
features missing from our model but present in theirs are whether the main verb phrase 
of the sentence is in active or passive Voice, and the Position of the constituent, before 
or after the predicator. However, these are partially captured by linear order and phrasal 
constituent type. On the other hand, they always assume knowledge of the frame, and 
because they only labeled the roles of actual sentence constituents, their model does not 
include null instantiated roles, nor is it obvious how to extend it to do so. 

Finally, our ultimate use for this model is not just role labeling, but to estimate pa- 
rameters when the training data is only partially observed. In that case, using the max- 
imum likelihood estimate is statistically sound, whereas maximizing the conditional 
likelihood would not be and a generative model is to be preferred. 

5 Conclusion and Future Work 

We have described and evaluated a successful generative model for semantic role la- 
beling. Our results to date are encouraging but more remains to be done. While small 
improvements, such as better unknown word handling, can be made to the model, we 
also see several larger issues that need to be addressed. To do role boundary detection 
a more sophisticated model is necessary, since under some circumstances non-verbal 
predicators assign roles to syntactically non-local constituents. Also, while it is fairly 
straightforward to generalize the current model to the case of multiple predicators per 
sentence, an articulated theory of when constituents can take roles from multiple predi- 
cators is still under development in FrameNet, and would require further articulation in 
our theory. Finally, it would also be useful to incorporate some extra syntactic informa- 
tion, such as predicator position, and the presence of coordination, and to model role- 
shuffling operations such as passivization, imperative forms, and extraposition, since 
these operations, if not modeled, can obscure linguistically motivated generalizations 
about the linear order of roles. 
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Abstract. This paper studies the properties and performance of models for es- 
timating local probability distributions which are used as components of larger 
probabilistic systems — history-based generative parsing models. We report ex- 
perimental results showing that memory-based learning outperforms many com- 
monly used methods for this task (Witten-Bell, Jelinek-Mercer with fixed weights, 
decision trees, and log-linear models). However, we can connect these results 
with the commonly used general class of deleted interpolation models by showing 
that certain types of memory-based learning, including the kind that performed so 
well in our experiments, are instances of this class. In addition, we illustrate the 
divergences between joint and conditional data likelihood and accuracy perfor- 
mance achieved by such models, suggesting that smoothing based on optimizing 
accuracy directly might greatly improve performance. 



1 Introduction 

Many disambiguation tasks in Natural Language Processing are not easily tackled by 
off-the-shelf Machine Learning models. The main challenges posed are the complexity 
of classification tasks and the sparsity of data. For example, syntactic parsing of natural 
language sentences can be posed as a classification task — given a sentence s, find a 
most likely parse tree t from the set of all possible parses of s according to a grammar 
G. But the set of classes in this formulation varies across sentences and can be very 
large or even infinite. 

A common way to approach the parsing task is to learn a generative history-based 
model P{s,t), which estimates the joint probability of a sentence s and a parse tree 
t [2]. This model breaks the complex {s,t) pair into pieces which are sequentially gen- 
erated, assuming independence on most of the already generated structure. More for- 
mally, the general form of the history-based parsing model is P{t) = nr=i 
Here the parse tree is generated in some order, where every generated piece yi (future) 
is conditioned on some context Xi (history). 

The most important factors in the performance of such models are (/) the chosen 
generative model, including the representation of parse tree nodes, and {ii) the method 
for estimating the local probability distributions needed by the model. Due to the sparse- 
ness of NLP data, the method of estimating the local distributions P{yi \xi) plays a very 

N. Lavrac et af (Eds.): ECML 2003, LNAI 2837, pp. 409^20, 2003. 
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important role in building a good model. We will sometimes refer to this problem as 
smoothing. 

The goals of the paper are three-fold: (i) to empirically evaluate the accuracy 
achieved by previously proposed and new local probability estimation models; (ii) to 
characterize the form of a kind of memory-based models that performed best in our 
study, showing their relation to deleted interpolation models; and (Hi) to study the rela- 
tionship among joint and conditional likelihood, and accuracy for models of this type. 

While various authors have described several smoothing methods, such as using 
a deleted interpolation model [5], or a decision tree learner [13], or a maximum en- 
tropy inspired model [3], there has been a lack of comparisons of different learning 
methods for local decisions within a composite system. Because our ultimate goal 
here is to have good classifiers for choosing trees for sentences according to the rule 
t = argmaxj/ P(s,t'), where the model P(s,t') is a product of factors given by the 
local models P(yi\xi), one can not isolate the estimation of local probabilities P(yi\xi) 
as a stand-alone problem, choosing a model family and setting parameters to optimize 
the likelihood of test data. The bias-variance tradeoff may be different [9]. We find 
interesting patterns in the relationship between joint and conditional data likelihood 
and accuracy performance achieved by such compound models, suggesting that heavier 
smoothing is needed to optimize accuracy and that fitting a small number of parameters 
to optimize it directly might greatly improve performance. 

The experimental study shows that memory-based learning outperforms commonly 
used methods for this task (Witten-Bell, Jelinek Mercer with fixed weights, decision 
trees, and log-linear models). For example, an error reduction of 5.8% in whole sentence 
accuracy is achieved by using memory-based learning instead of Witten-Bell, which is 
used in the state-of-the art model [5]. 

2 Memory-Based and Deleted Interpolation Models 

In this section we demonstrate the relationship between deleted interpolation models 
and a class of memory-based models that performed best in our study. 

2.1 Deleted Interpolation Models 

Deleted interpolation models estimate the probability of a class y given a feature vec- 
tor (context) of n features, P{y\xi, . . . , Xn), by linearly combining relative frequency 
estimates based on subsets of the full context (si, . . . , Xn), using statistics from lower- 
order distributions to reduce sparseness and improve the estimate. To write out an ex- 
pression for this estimate, let us introduce some notation. We will denote by Sj subsets 
of the set {1, . . . ,n} of feature indices. Sj can take on 2” values ranging from the 
empty set to the full set {1, ... , n}. We will denote by Xs the tuple of feature values 
of X for the features whose indices are in S. For example 2 , 3 } = {xi,X 2 ,X 3 ). For 
convenience, we will add another set, denoted by *, which we will use to include in the 
interpolation the uniform distribution P{y) = y, where V is the number of possible 
classes y. The general form of estimate is then: 

P(y\X)= Y. XsXX)P{y\Xs^ 

Si<Z{l,...,n}ySi=* 



( 1 ) 
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Here P are relative frequency estimates and P(y|X*) = ^ by definition. The inter- 
polation weights A are shown to depend on the full context X = (xi, . . . , x„) as well 
as the specific subset Si of features. In practice parameters as general as that are never 
estimated. For strictly linear feature subsets sequences, methods have been proposed 
to fit the parameters by maximizing the likelihood of held-out data through EM while 
tying parameters for contexts having equal or similar counts ^ 



2.2 (A Kind of) Memory -Based Learning Models 

We will show that a broad class of memory-based learning methods have the same 
form as Equation 1 and are thus a subclass of deleted interpolation models. While [18] 
have noted that memory-based and back-off models are similar in the way they use 
counts and in the way they specify abstraction hierarchies among context subsets, the 
exact nature of the relationship is not made precise. They emphasize the case of 1- 
nearest neighbor and show that it is equivalent to a special kind of strict back-off (non- 
interpolated) model. Our experimental results suggest that a number of neighbors K 
much larger than 1 works best for local probability estimation in parsing models. The 
exact form of the interpolation weights A as dependent on contexts and their counts is 
therefore crucial for combining more specific and more general evidence. We will look 
at memory-based learning models determined by the following parameters: 



- K, the number of nearest neighbors. 

- A distance function A{X^X') between feature vectors. This function should de- 
pend only on the positions of matching/mis-matching features. 

- A weighting function w{X, X'), which is the weight of neighbor X' of X. We will 
assume that the weight is a function of the distance, i.e. w{X, X') = w{A{X, X')). 



Let us denote by Nk{X) the set of K nearest neighbors of X. The probability of a 
class y given X is estimated as: 



P{y\x) 



'^x'eNK(x) X'))S{y, y') 

Hx'(^Nk{X) w{A{X,X')) 



(2) 



Here y' is the label of the neighbor X' , and y') = 1 iff y = y', and 0 otherwise. 
Eor nominal attributes, as always used in conditioning contexts for natural language 
parsers, the distance function commonly distinguishes only between matches and mis- 
matches on feature values, rather than specifying a richer distance between values. We 
will limit our analysis to this case as specified in the conditions above. In the majority 
of applications of k-NN to natural language tasks, simple distance functions have been 
used [6]^. 

’ When not limited to linear subsets sequences, it is possible to optimize tied parameters, but EM 
is difficult to apply and we are not aware of work trying to optimize interpolation parameters 
for models of this more general form. 

^ Richer distance functions have been proposed and shown to be advantageous [18, 12, 8]. How- 
ever, such distance functions are harder to acquire and using them raises significantly the 
computational complexity of applying the k-NN algorithm. When simple distance functions 
are used, clever indexing techniques make testing a constant time operation. 
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The distance function A{X,X') will take on one of 2" values depending on the 
indices of the matching features between the two vectors. In practice we will add V 
artificial instances to the training set, one of each class (to avoid zero probabilities). 
These instances will be at an additional distance value Ssmooth which will normally be 
larger that the other distances. We require that the distance A{X, X') be no smaller 
than A{X, X") if X" matches X on a superset of the attributes on which X' matches. 

The commonly used overlap distance function, A{X,X') = X^r=i 
satisfies these constraints. Every feature has an importance weight Wi > 0. This is the 
distance function we have used in our experiments, but it is more restrictive than the 
general case for which our analysis holds, because it has only n + 1 parameters — the 
Wi and Ssmooth- The general case would require 2" + 1 parameters. 

We go on to introduce one last bit of notation. We will say that the schema S of an 
instance X' with respect to an instance X is the set of feature indices on which the two 
instances match. (We are herer using similar terminology to [18]). It is clear that the 
distance A{X, X') depends only on the schema S of X' with respect to X. The same 
holds true for the weight of X' with respect to X. We can therefore think of the K 
nearest neighbors as groups of neighbors that have the same schema. Let us denote by 
5'if (X) the set of schemata of the K nearest neighbors of X. We assume that instances 
in the same schema are either all included or excluded from the nearest neighbors set. 
The same assumption has commonly been made before [18]. We have the following 
relationships between schemata S' < S if the schema S' is more specific than S, i.e. 
the set of feature indices S' is a superset of the set S. We will use S' ^ S for immediate 
precedence, i.e. S' ^ S iff S' < S and there are no schemata between the two in the 
ordering. We can rearrange Equation 2 in terms of the participating schemata and then 
after an additional re-bracketing, we obtain the same form as Equation 1 . 



P{y\X)= As,(X)P(y|Xs,) 

SjeSKix) 



The interpolation coefficients have the form: 



As,(X) 



(w{A{Sj)) J2 s3^s'.,s'.&Sk{x) MAiS'^))) 

Z{X) 



iXsP 



(3) 



(4) 



Z{X)= E c(Xs,) (5) 

SjeSK{x) \ Sj^S'.,s'.eSK{x) J 

This concludes our proof that memory-based models of this type are a subclass of 
deleted interpolation models. It is interesting to observe the form of the interpolation 
coefficients. We can notice that they depend on the total number of instances matching 
the feature subset as is usually true of other linear subsets deleted interpolation meth- 
ods such as Jelinek-Mercer smoothing and Witten-Bell smoothing. However they also 
depend on the counts of more general subsets as seen in the denominator. The different 
counts are weighted according to the function w. 

In practice the most widely used deleted interpolation models exclude some of the 
feature subsets and estimates are interpolated from a linear feature subsets order. These 
models can be represented in the form: 
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-P(y|xi, . . . , Xn) — ^xi,...,XjiP{y\xi^ . . . ^Xn) + (1 . ,Xn) P{y\xi ^ . . . ,Xn — l) (6) 

The recursion is ended with the uniform distribution as above. Memory-based mod- 
els will be subclasses of deleted interpolation models of this form if we define A{S) = 
Z\({1, . . . , i}), where i is the largest numbers such that {1, . . . , i} > 5. If such i does 
not exist A{S) = 4i({}) or 4i(*) for the artificial instances. 

3 Experiments 

We investigate these ideas via experiments in probabilistic parse selection from among 
a set of alternatives licensed by a hand-built grammar in the context of the newly devel- 
oped Redwoods HPSG treebank [14]. HPSG (Head-driven Phrase Structure Grammar) is 
a modern constraint-based lexicalist (unification) grammar, described in [15]. 

The Redwoods treebank makes available syntactic and semantic analyses of much 
greater depth than, for example, the Penn Treebank. Therefore there are a large number 
of features available that could be used by stochastic models for disambiguation. In 
the present experiments, we train generative history-based models for derivation trees. 
The derivation trees are labeled via the derivation rules that build them up; an example 
is shown in Figure 1 . All models use the 8 features shown in Figure 2. They estimate 
the probability P{expansion{n)\history{n)), where expansion is the tuple of node 
labels of the children of the current node and history is the 8-tuple of feature values. 
The results we obtain should be applicable to Penn Treebank parsing as well, since we 
use many similar features such as grand-parent information and build similar generative 
models. 

The accuracy results we report are averaged over a ten-fold cross-validation on the 
data set summarized in Table 1 . Accuracy results denote the percentage of test sentences 
for which the highest ranked analysis was the correct one. This measure scores whole 
sentence accuracy and is therefore stricter than the labelled precision/recall measures, 
and more appropriate for the task of parse selection^. 



Table 1. Annotated corpus used in experiments: The columns are, from left to right, the total 
number of sentences, average length, average lexical ambiguity (number of lexical entries per to- 
ken), average structural ambiguity (number of parses per sentence), and the accuracy of choosing 
at random 



sentences 


length 


lex ambiguity 


struct ambiguity 


random baseline 


5312 


7.0 


4.1 


8.3 


25.81% 



3.1 Linear Feature Subsets Order 

In this first set of experiments, we compare memory-based learning models restricted 
to linear order among feature subsets to deleted interpolation models using the same 

^ Therefore we should expect to obtain lower figures for this measure compared to labelled 
precision/recall. As an example, the state of the art unlexicalized parser [11] achieves 86.9% 
F measure on labelled constituents and 30.9% exact match accuracy. 
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IMPER 

I 

HCOMP 

HCOMP SEE.V3 
LET.Vl US see 

I I 

Let us 

Fig. 1. Example of a Derivation Tree 



No. 


Name 


Example 


1 


Node Label 


HCOMP 


2 


Parent Node Label 


HCOMP 


3 


Node Direction 


left 


4 


Parent Node Direction 


none 


5 


Grandparent Node Label 


IMPER 


6 


Great Grandparent Label 


yes 


7 


Left Sister Node Label 


HCOMP 


8 


Category of Node 


verb 



Fig. 2. Features over derivation trees 



linear subsets order. The linear interpolation sequence was the same for all models 
and was determined by ordering the features of the history by gain-ratio. The resulting 
order was: 1,8,2, 3, 5, 4, 7, 6 (see Table 2). Numerous methods have been proposed for 
estimation of parameters for linearly interpolated models"^. In this section we survey the 
following models: 

Jelinek Mercer with a fixed interpolation weight A for the lower-order model (and 
1 — A for the higher-order model). This is a model of the form of Equation 6, where the 
interpolation weights do not depend on the feature history. We report test set accuracy 
for varying values of A. We refer to this model as JM. 

Witten-Bell smoothing [17] uses as an expression for the weights : A(a;i , . . . ,Xi) = 
c(xi....,c.A+1x | ;;;cf;,li,...,x,)>0 |- We refer to this model as WBd. The original Witten- 
Bell smoothing^ is the special case with d = I, but use of an additional parameter d 
which multiplies the number of observed outcomes in the denominator is commonly 
used in some of the best-performing parsers and named-entity recognizers [1,5]. 

Memory-based models restricted to linear sequence, with varying weight function 
and varying values of K. The restriction to linear sequence is obtained by defining 
the distance function to be of the special form described at the end of section 2. We 
define the distance function as follows for subsets of the linear generalization sequence: 
Z\({1, . . . , n}) = 0, • • • , 2\({}) = n, 2\(*) = n+1. We implemented several weighting 
methods, including inverse, exponential,information gain, and gain-ratio. The weight 
functions inverse cubed(INV3) and inverse to the fourth (INV4) worked best. They are 
defined as follows: INV3(d)=(l/(d + 1))^ INV4(d)=(l/(<i + 1))^. We refer to these 
models as LKNN3 and LKNN4, standing for linear k-NN using weighting INV3 and 
linear k-NN using weighting function INV4 respectively. 

Figure 3 shows parse selection accuracy for the three models discussed above — 
JM in (a), WBd in (b) and LKNN3 and LKNN4 in (c). We can note that the maxi- 
mal performance of JM (79.14% at A = .79) is similar to the maximal performance 
of WBd (79.60% at d = 20). The best accuracies are achieved when the smoothing 

^ In addition to models of the form of Equation 6 there are models that use modified distributions 
(not the relative frequency). Comparison to these other good models (e.g., forms of Kneser- 
Ney and Katz smoothing [4] is not the subject of this study and would be an interesting topic 
for future research. 

^ Method C, also used in [4]. 
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(a) 



(b) 



(c) 



Fig. 3. Linear Subsets Deleted Interpolation Models: Jelinek Mercer (JM) (a) Witten-Bell {b) and 
k-NN (c) 



is much heavier than we would expect. For example, one would think that the higher 
order distributions should normally receive more weight, i.e. A < .5 for JM. Similarly, 
for Witten-Bell smoothing, the value of d achieving maximal performance is larger than 
expected. WB is an instance of WBd and we see that it does not achieve good accu- 
racy. [5] reports that values of d between 2 and 5 were best. The over-smoothing issue 
is related to our observations on the connection between joint likelihood, conditional 
likelihood, and parse selection accuracy, which we will discuss at length In Section 4. 

The best performance of LKNN3 is 79 . 94 % at K = 3 , 000 and the best performance 
of LKNN4 is 80 . 18 % at K = 15 , 000 . Here we also note that much higher values of K 
are worth considering. In particular, the commonly used K = I ( 74 . 07 % for LKNN4) 
performs much worse than optimal. The difference between LKNN4 at K = 15 , 000 
and JM at A = 0.79 is statistically significant according to a two-sided paired r-test at 
level a = .05 (p-value=.024). The difference between LKNN4 and the best accuracy 
achieved by WBd is not significant according to this test but the accuracy of LKNN4 
is more stable across a broad range of K values and thus the maximum can be more 
easily found when fitting on held-out data. 

We saw that using k-NN to estimate interpolation weights in a strict linear interpo- 
lation sequence works better than JM and WBd. The real advantage of k-NN, however, 
can be seen when we want to combine estimates from more general feature contexts but 
do not limit ourselves to strict linear deleted Interpolation sequences. The next section 
compares k-NN in this setting to other proposed alternatives. 

3.2 General k-NN, Decision Trees, and Log-Linear Models 

In this second group of experiments we study the behavior of memory-based learning 
not restricted to linear subset sequences, using different weighting schemes and number 
of neighbors, comparing this result to the performance of decision trees and log-linear 
models. The next paragraphs describe our implementation of these models in more 
detail. 

For fc-NN we define the distance metric as follows: 4\({zi, . . . ,ife}) = n — k; 
/!({}) = n, Al(*) = n + 1. We report the performance of inverse weighting cubed 
(INV3) and inverse weighting to the fourth(INV4) as for linear fc-NN. We refer to these 
two models as KNN3 and KNN4 respectively. 
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Table 2. Best Parse Selection Accuracies Achieved by Models 



Model 


KNN4 


DecTreeWBd 


LogLinSingle|LogLinPairs 


LogLinBackoff 


Accuracy 


80.79% 


79.66% ( 


78.65% |78.91% 


77.52% 1 



Decision trees have been used previously to estimate probabilities for statistical 
parsers [13]®. We found that smoothing the probability estimates at the leaves by linear 
interpolation with estimates along the path to the root improved the results significantly, 
as reported in [13]. We used WBd and obtained final estimates by linearly interpolating 
the distribution at the leaf up to the root and the uniform distribution. We can think 
of this as having a different linear subset sequence for every leaf. The obtained model 
is thus an instance of a deleted interpolation model ([13]). We denote this model as 
DecTreeWBd. 

Log-linear models have been successfully applied to many natural language prob- 
lems, including conditional history-based parsing models [16], part-of-speech tagging, 
PP attachment, etc. In [3], the use of a “maximum entropy inspired” estimation tech- 
nique leads to the currently best performing parser^. The space of possible maximum 
entropy models that one could build is very large. In our implementation here, we 
are using only binary features over the history and expansion of the following form: 
fvii,...,Vik,expansion{x'i, . . . ,x'^,expansion') = 1 iff expansion' = expansion and 
Xii = Vii ■ ■ ■ Xik = Vik- Gaussian smoothing was used by all models. We trained three 
models differing in the type of allowable features (templates). 

- Single attributes only. This model has the fewest number of features. Here we allow 
the features to be defined by specifying a value for a single attribute for a history. 
We denote this model LogLinSingle. 

- This model includes features looking at the values of pairs of attributes. However, 
we did not allow all pairs of attributes, but only pairs including attribute number 1 
(the node label). These are a total of 8 pairs (including the singleton set containing 
only attribute 1). We denote this model Log Lin Pairs. 

- This final model mimics the linear feature subsets deleted interpolation models of 
section 3.1. It uses all subsets in the linear sequence, which makes for 9 subsets. 
We denote this model LogLinBackoff. 

Figure 4 shows the performance of A:-NN using the two inverse weighing functions 
for varying values of K and DecTreeWBd for varying values of d. Table 2 shows the 
best results achieved by KNN4, DecTreeWbd and the three log-linear models. 

The results suggest that memory-based models perform better than decision trees 
and log-linear models in combining information for probability estimation. The differ- 
ence between the accuracy of KNN4 and WBd is 5.8% error reduction and is statis- 

® We induce decision trees using gain-ratio as a splitting criterion (information gain divided by 
the entropy of the attribute). We stopped growing the tree when all samples in a leaf had the 
same class, or when the gain ratio was 0. 

^ In [3], estimates based on different feature subsets are multiplied and the model has a form 
similar to that of a log-linear model. The obtained distributions are not normalized but are 
close to summing to one. 
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Fig. 4. k-NN using INV3 and INV4(a) and DecTreeWBd (b) 

tically significant at level a = 0.01 according to a two-sided paired t-test (p-value= 
0.0016). 

This result agrees with the observation in [7] that memory-based models should be 
good for NLP data, which is abundant with exceptions and special cases. The study 
in [7] is restricted to the classification case and K = 1 or other very small values of 
K are used. Here we have shown that these models work particularly well for proba- 
bility estimation. Relative frequency estimates from different schemata are effectively 
weighted based on counts of feature subsets and distance- weighting. It is especially sur- 
prising that these models performed better than log-linear models. Log-linear/logistic 
regression models are the standardly promoted statistical tool for these sorts of nominal 
problems, but actually we find that simple memory-based models performed better. The 
log-linear models we have surveyed perform more abstraction (by just including some 
features) and are less easily controllable for overfitting; abstracting away information is 
not expected to work well for natural language according to [7]. 

4 Log-Likelihoods and Accuracy 

Our discussion up to now included only parse selection results. But what is the rela- 
tion to the joint likelihood of test data (likelihood according to a model of the correct 
parses) or the conditional likelihood (the likelihood of the correct parse given the sen- 
tence)? Work in smoothing for language models optimizes parameters on held-out data 
to maximize the joint likelihood, and measures test set performance by looking at per- 
plexity (which is a monotonic function of the joint likelihood) [4]. Results on word 
error rate for speech recognition are also often reported [10], but the training process 
does not specifically try to minimize word error rate (because it is hard). In our exper- 
iments we observe that much heavier smoothing is needed to maximize accuracy than 
to maximize joint log-likelihood. 

We show graphs for the model JM for the joint log-likelihood averaged per ex- 
pansion, and the conditional log-likelihood averaged per sentence in Figure 5 (a). The 
corresponding accuracy curve is shown in Figure 3 (a). The graph in 5 {b) shows joint 
and conditional log-likelihood curves for model KNN4; its accuracy curve is in Figure 
A (a). 
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Fig. 5. JM (a) and k-NN using INV4(Z)) 



The pattern of the points of maximum for the test data joint log-likelihood, condi- 
tional log-likelihood and parse selection accuracy is fairly consistent across smoothing 
methods. The joint likelihood increased in the beginning with smoothing up to point, 
and then started to decrease. The accuracy followed the pattern of the joint likelihood, 
but the peak performance was reached long after the best settings for joint likelihood 
(and before the best settings for conditional likelihood). This relationship between the 
maxima — first joint log-likelihood, followed by the accuracy maximum holds for all 
surveyed models. This phenomenon could be partly explained by reference to the in- 
creased significance of the variance in classification problems [9]. Smoothing reduces 
the variance of the estimated probabilities. In models of the kind we study here, where 
many local probabilities are multiplied to obtain a final probability estimate, assuming 
independence between model sub-parts, the bias-variance tradeoff may be different and 
over- smoothing even more beneficial. There exisf smoothing methods that would give 
very bad joint likelihood but still good classification as long as the estimates are on the 
right side of the decision boundary. We can also note that, for the models we surveyed, 
achieving the highest joint likelihood did not translate to being the best in accuracy. For 
example, the best joint log-likelihood was achieved by DecTreeWBd followed very 
closely by WBd. The joint log-likelihood achieved by linear k-NN was worse and the 
worst was achieved by general k-NN (which performed best in accuracy). Therefore fit- 
ting a small number of parameters for a model class to optimize validation set accuracy 
is worth it for choosing the best model. 

Another interesting phenomenon is that the conditional log-likelihood continued 
to increase with smoothing and the maximum was reached at the heaviest amount of 
smoothing for almost all surveyed models — JM, WBd, DecTreeWBd, and KNN3. 
For the other forms of k-NN the conditional log-likelihood curve was more wiggly and 
peaked at several points going up and down. We explain this increase in conditional 
log-likelihood with heavy smoothing by the tendency of such product models to be 
over-confident. Whether they are wrong or right, the conditional probability of the best 
parse is usually very close to 1. The conditional log-likelihood can thus be improved by 
making the model less confident. Additional gains are possible even long after the best 
smoothing amount for accuracy. 
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(a) 




(b) 

Fig. 6. PP Attachment Task: Jelinek-Mercer with Fixed Weight A for the higher order model(cr) 
and Witten-Bell WBd for varying d (b) 



One could think that this phenomenon may be specific to our task — selection of the 
best parse from a set of possible analyses, and not from all parses to which the model 
would assign non-zero probability. To further test the relationship between likelihoods 
and accuracy, we performed additional experiments on a different domain. The task is 
PP (prepositional phrase) attachment given only the four words involved in the depen- 
dency — n, ni,F, U2, such as e.g. eat salad with fork. The attachment of the PP phrase 
p, ri2 is either to the preceding noun rii or to the verb v. We tested a generative model 
for the joint probability P{Att,V, Ni, P, N2), where Att is the attachment and can 
be either noun or verb. We graphed the likelihoods and accuracy achieved when using 
Jelinek-Mercer with fixed weight and Witten-Bell with varying parameter d, as for the 
parsing experiments. Figure 6 shows curves of accuracy, (scaled) joint log-likelihood 
and conditional log-likelihood. We see that the pattern described above repeats. 

5 Summary and Future Work 

The problem of effectively estimating local probability distributions for compound deci- 
sion models used for classification is surprisingly unexplored. We empirically compared 
several commonly used models to memory-based learning and showed that memory- 
based learning achieved superior performance. The added flexibility of an interpola- 
tion sequence not limited to a linear feature sets generalization order paid off for the 
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task of building generative parsing models. Further research is necessary studying the 
performance of memory-based models — such as comparing to Kneser-Ney and Katz 
smoothing, and fitting the k-NN weights on held-out data. 

Our experimental study of the relationship among joint and conditional likelihood, 
and classification accuracy conveyed interesting regularities for such models. A more 
theoretical quantification of the effect of the bias and variance of the local distributions 
on the overall system performance is a subject of future research. 
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Abstract. Modeling learning agents in the context of Multi-agent Sys- 
tems requires an adequate understanding of their dynamic behaviour. 
Evolutionary Game Theory provides a dynamics which describes how 
strategies evolve over time. Borgers et al. [1] and Tuyls et al. [11] have 
shown how classical Reinforcement Learning (RL) techniques such as 
Cross-learning and Q-learning relate to the Replicator Dynamics (RD). 
This provides a better understanding of the learning process. In this 
paper, we introduce an extension of the Replicator Dynamics from Evo- 
lutionary Game Theory. Based on this new dynamics, a Reinforcement 
Learning algorithm is developed that attains a stable Nash equilibrium 
for all types of games. Such an algorithm is lacking for the moment. 
This kind of dynamics opens an interesting perspective for introducing 
new Reinforcement Learning algorithms in multi-state games and Multi- 
Agent Systems. 



1 Introduction 

In this paper a new RL algorithm, based on the Replicator Dynamics (RD) from 
Evolutionary Game Theory (EGT), is introduced. Several authors have already 
noticed and proved that the RD can emerge from several RL schemes. Borgers et 
al. [1] have shown that the RD can emerge from Gross learning and the authors 
[10,11] have shown that the RD emerge from Learning Automata and Boltz- 
mann Q-learning. This emergence offers a lot of advantages. For instance, these 
evolutionary dynamics open a new perspective in understanding and fine tuning 
the learning process in games and more general in Multi- Agent Systems (MAS) . 
Learning can be very time consuming, especially when you need to fine tune 
some parameters. As the experiments in Tuyls et al. illustrate [10,11], plotting 
the direction field of the RD beforehand in one-state games gives information 
on how to initialize the learning agents so that they end up in the most in- 
teresting attractors. As these previous results show, convergence is not always 
guaranteed for some particular kind of games. For this reason we adapted the 
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RD to an extended evolutionary dynamics which describes the desired behaviour 
from the Reinforcement Learning (RL) agents. After this, the accompanying RL 
algorithm of these Extended Replicator Dynamics (ERD) will be developed. 

The outline of the paper is as follows, in section 2 we elaborate on the RD 
from EGT and on the important connection with RL, more specifically the Cross 
learning model. Section 3 describes how the RD are extended to satisfy the need 
of convergence to certain attractors in certain games. After this we describe 
the new RL algorithm matching this ERD. Section 4 describes the experiments, 
confirming the results from section 3. Finally, we end with a conclusion. 

2 Selection Dynamics and Cross Learning 

In this section we elaborate on an important result of Borgers and Sarin [1]. 
They showed that in an appropriately constructed time limit, the Cross learning 
model converges to the Replicator Equations. Also it is shown [10,11] how the RD 
emerge from Learning Automata and Boltzmann Q-learning. The perspective of 
evolutionary dynamics in reinforcement learning offers a lot of advantages. First, 
it becomes possible to understand RL in terms of evolutionary dynamics, i.e. 
selection and mutation mechanisms, second it allows to fine tune the learning 
process in advance. In this paper a new RL algorithm will be developed based 
on the replicator equations, which behaves as one desires. This means that the 
learning process is guaranteed to converge to a stable Nash equilibrium in all 
types of one-state games. In a first subsection we briefly explain the RD, as we 
will alter these dynamics in section 3. After this the Cross learning model will be 
explained, as this model will serve as a basis for a new RL algorithm in section 3. 

2.1 The Replicator Equations 

The basic concepts and techniques developed in ECT were initially formulated 
in the context of evolutionary biology [13,9]. In this context, the strategies of 
all the players are genetically encoded (called genotype). Each genotype refers 
to a particular behavior which is used to calculate the payoff of the player. The 
payoff of each player’s genotype is determined by the frequency of other player 
types in the environment. 

One way in which ECT proceeds is by constructing a dynamic process in 
which the proportions of various strategies in a population evolve. Examining 
the expected value of this process gives an approximation which is called the 
RD. Simpy stated, an abstraction of an evolutionary process usually combines 
two basic elements: selection and mutation. Selection favors some varieties over 
others, while mutation provides variety in the population. RD highlights the role 
of selection, it describes how systems consisting of different strategies change over 
time. They are formalized as a system of differential equations. Each replicator 
(or genotype) represents one (pure) strategy. This strategy is inherited by all 
the offspring of that replicator. The general form of a replicator dynamic is the 
following: 
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^ = px),-x-^x]x, (1) 

at 

In equation (1), Xi represents the density of strategy i in the population, A 
is the payoff matrix which describes the different payoff values each individual 
replicator receives when interacting with other replicators in the population. 
The state of the population (x) can be described as a probability vector x = 
(xi, X 2 , ■■■, xj) which expresses the different densities of all the different types 
of replicators in the population. Hence {Ax)i is the payoff which replicator i 
receives in a population with state x and x • Ax describes the average payoff in 

dxi 

the population. The growth rate of the population share using strategy i 
equals the difference between the strategy’s current payoff and the average payoff 
in the population. For further information we refer the reader to [13,2]. 

In this paper the players are reinforcement learners. We consider a game to 
be played between the members of two different populations, each population 
representing one reinforcement learner. As a result, we need two systems of 
differential equations: one for the row player (P) and one for the column player 
{Q). This setup corresponds to a RD for asymmetric games. If A = P*, equation 
(1) would again emerge. 

This translates into the following replicator equations for the two populations: 



dpi 

dt 



dqi 

dt 



[(Aq), - p • Aq]pi 


(2) 


[(Rp)i - q • Bp]q, 


(3) 



As can be seen in equation (2) and (3), the growth rate of the types in 
each population is now determined by the composition of the other population. 
Note that, when calculating the rate of change using these systems of differential 
equations, two different payoff matrices {A and B) are used for the two different 
players. 



2.2 The Cross Learning model 

The cross learning model is a special case of the standard reinforcement learning 
model [1] . The model considers several agents playing the same normal form 
game repeatedly in discrete time. At each point in time, each player is charac- 
terized by a probability distribution over her strategy set which indicates how 
likely she is to play any of her strategies. At each time step (indexed by n) , a 
player chooses one of her strategies based on the probabilities which are related 
to each isolated strategy. As a result a player can be represented by a probability 
vector: 

p{n) = {pi{n),...,pr{n)) 

In case of a 2-player game with payoff matrix U, player k {k € 1,2) gets 
payoff when player 1 chooses strategy i and player 2 chooses strategy j. 
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Players do not observe each others’ strategies and payoffs, they are uninformed 
players. After each stage they update their probability vector, according to. 



Pi{n + 1) = Uij + (1 - Uij)p^{n) 



Pi>{n+ 1) = (1 - Uij)pi<{n) 



( 4 ) 

( 5 ) 



where 0 < Uij < 1. Equation (4) expresses how the probability of the selected 
strategy (i) is updated and equation (5) expresses how all the other strategies 
i' ^ i are adjusted. The probability vector of Q{n) is updated in an analogous 
manner. This entire system of equations defines a stochastic update process for 
the players {p^(n)}. This process is called the ’’Cross learning process” in [1]. 
Borgers and Sarin showed that in an appropriately constructed continuous time 
limit, this model converges to the asymmetric, continuous time version of the 
replicator dynamics, see section 2. 



3 Extending the Replicator Equations 
and the Cross Learning Model 

The reasons for changing the RD and looking for a new dynamics become clear 
from [10,11]. In one-state games it is impossible for Cross learning and Learning 
Automata to guarantee convergence to a stable Nash equilibrium in all types of 
games. In Boltzmann Q-learning a Nash equilibrium can be attained, but there 
is no guarantee for stability. If a dynamical system can be found that offers 
these guarantees, we can construct a reinforcement learning algorithm in an 
analogous manner with the same behaviour of this adapted dynamical system. 
This makes the approach of replicator equations very interesting and promising 
toward multi-state games and Multi- Agent Systems. 

In the first subsection we will alter the traditional replicator equations in such 
a manner that in all classes of games the players will converge to a particular 
Nash equilibrium (see section 4). These new equations are referred to as the 
Extended Replicator Dynamics (ERD). In the second subsection we will present 
the accompanying learning algorithm of the changed dynamics based on the 
Cross learning model. 



3.1 Developing an Extended Replicator Dynamics 

When constructing an altered selection dynamics, we take the replicator dy- 
namics and its interpretation as a starting point. In replicator dynamics, the 
probabilities a players has over its strategies are changed greedily with respect 
to payoff in the present. In this section a method is shown to change this proba- 
bilities over strategies not only with respect to payoff growth in the present but 
also to payoff growth in the future. We call those players that act so as to opti- 
mize future payoff extended Cross learners and the class of dynamics associated 
extended dynamics. 
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There are of course different ways to build such extended players. The most 
obvious is to use a linear approximation of the evolution of fitness in time. This 
is the approach we use here. 

For the ERD we compose the following equation /, 

f{x) = RD{x) + {dRD{x)/dt)*r] (6) 

were RD(x) is, 

^ = [(^x)i - X • Ax]xi (7) 

dt 

and 77 is the parameter that determines how far in the future we need to look. 

The composition of equation 6 can best be understood as follows. When using 
the classical replicator equations (i.e. RD(x)), we act greedily toward payoff in 
the present. When adding our second term, 

{dRD{x) / dt) * -q (8) 

we act greedily toward payoff in the future. From an analytical point of view, 
the second term gives actions that are winning fitness (whether its fitness is 
negative or positive) a positive push toward a higher chance of getting selected. 
On the other hand, actions that are losing fitness (again whether its fitness is 
negative or positive) are given a negative push toward a lower chance of getting 
selected. This extends the traditional replicator equations. The algorithm we 
used to calculate this dynamics can be found in algorithm 1. It will be referred 
to as the ERD algorithm since it extends the tradional RD{x) with the future. 
This extended evolutionary dynamics succeeds in converging to a stable Nash 
Equilibrium (NE) in all 3 categories of 2*2 games. Experiments confirming this 
can be found in section 4. 

In the next section the reinforcement learning algorithm based on these ex- 
tended dynamics is developed. 

3.2 Developing an Extended RL- Algorithm 

To develop a RL-algorithm based on the Extended Replicator Dynamics, we 
start from the result of Borgers and Sarin. They showed that in an appropri- 
ately constructed continuous time limit, this model converges to the asymmetric, 
continuous time version of the replicator dynamics [1] . Recall that we extended 
these dynamics with the following acceleration term, 

(dRD{x)/dt) * q (9) 

expressing that we act greedily toward payoff in the future. So for the part of 
the RD we can rely on the Cross algorithm of Borgers and Sarin. For the part 
of equation 9 we calculate an approximation in algorithm 2. 

Step a of the algorithm is nothing more than the calculation of Cross Learn- 
ing. Step b calculates the approximation of the ERD, where the accel variable 
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Algorithm 1 The algorithm used to calculate extended replicator dynamics. 
Parameters: 

r] How far in the future we look for calculating future fitness growth. 
stepSize Determines how large steps we take in the updaterule 
ai..r The actions a available to the players p. 
p(n) = (pi(n), ...,Pr{n)) The chances for p playing at. 

RDai(n) The replicator dynamics function of action a from player p at timestep n 
according to the current position of all player’s strategies. This implicitly defines 

the game being played. 



For all actions ai. 

1. Calculate an approximation to the replicator dynamics acceleration in the current 
position of all player’s strategies in A. 

QjCC€-lGV(liiOTla,^(^n) ■ — — 1) RD a^i^n — 2) i 

2. Ensure payoff positivity. It turns our dynamics into one that is stable in a NE. 
if(-'(((i?Da.(„) > 0) A {RDai(n) +ri* accelerationa^(^n) > 0))V 

{{RDa^(n) < 0) A {RDa^(n) + V * acceleratioUa^^n) < 0))) 

{acceleratiorip^ := 0;} 

3. Adjust the strategy according to: 

fai(n) := stepSize * RDa-(n) + rj * acceleratioriai(n)', f is the function we are ap- 
proximating 
Pi{n) ■- pi{n) + /„.(„) 



contains the approximation of equation 9. Furthermore payoff positivity is en- 
sured. Payoff positivity means that strategies that earn above (below) average 
have positive (negative) growth rates. Actually, with payoff positivity we ensure 
that there is stability in all Nash equilibria. Step c executes the update of the 
probabilities of the different actions and updates the acceleration term (or the 
payoff in the future). To finalize, the probabilities are normalized. 

In the next section we will show experiments in all classes of games with 
this algorithm and it will become clear that it always converges to the Extended 
Replicator Dynamics. 

4 Experiments 

In this section we describe some experiments that illustrate the mathematical 
derivation of section 3. The experiments have been conducted with 2x2 games. 
The general payoff matrices, A for the first player and B for the second, are de- 
fined in table 1. The family of 2 x 2 games is usually classified in three subclasses, 
as follows [5], 

Subclass 1: if (on — a 2 i)(ai 2 — 022 ) > 0 or {bn — &i 2 )(fo 2 i — ^ 22 ) > 0, at least 
one of the 2 players has a dominant strategy, therefore there is just 1 strict 
equilibrium. 



Extended Replicator Dynamics as a Key to Reinforcement Learning 427 



Algorithm 2 An algorithm for RL based on extended replicator dynamics. 

Parameters: 






r] How far in the future we look for calculating future fitness growth. 
stepSize Determines how large steps we take in the updaterule 
ai,.r The actions a available to the players p. 
p(n) = (pi(n), ...,Pr{n)) The chances for p playing at. 

A variable containing an approximative current acceleration for each Pi{n). 
6 Learningrate for learning accel. 



former Speed A variable containing an approximative speed for each m. 
Enu(ot)p The function returning the immediate payoff from the environment to 
player p when all players have acted according to at. This implicitly defines the 

game being played. 

Actp The function defining the action selection method. 



For all players p 

1. actions p := Actp\ 

For all players p 

1. rewardsp := Env{actions)p 
For all players p 

1. For all actions ai 

(a) Calculate CrossLearning. 
if {ai==actionSp) 

then { 

y 

CrossLearning := rewardSp * (1 — Pi(n)); } 
else I 

y 

CrossLearning := —rewardsp*pi{n);} 

(b) Calculate extended RD approximation, making sure we retain payoff pos- 
itive. 

ii{sign{CrossLearning) == sign{CrossLearning + accelm)) 
then ExtendRD Learning := CrossLearning + occe^aC 
else ExtendRD Learning := 0; 

(c) Perform the updaterule and calculation of accelm ■ 

Pi{n'^ := Pi (n) + stepSize * extendRD Learning-, ^ 

accelai -■= acceLj + 0 * {{stepSize * CrossLearning — former Speeh^. ) — 
oca^ai)', 

formerSpeed^. -.= stepSize* CrossLearning-, } 

2. Normalize the Pi{n) so that their sum is 1. 



Subclass 2: if (an - a2i)(ai2 - 022) < 0,(6n - &i2)(&2i - b22) < 0, and (an - 
fl2i)(^ii ~ ^12) > 0, there are 2 pure equilibria and 1 mixed equilibrium. 
Subclass 3: if (an - a2i)(ai2 - 022) < 0,(6n - &i2)(&2i - b22) < 0, and (an - 
a2i)(^ii — &12) < 0, there is just 1 mixed equilibrium. 

The first subclass includes those type of games where each player has a dominant 
strategy, as for instance the prisoners dilemma. However it includes a larger 
collection of games since only 1 of the players needs to have a dominant strategy. 
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Table 1. The left matrix (A) defines the payoff for the row player, the right matrix 
(B) defines the payoff for the column player. 



A = 



/ an ai2 A g 

\^a21 022 J 



f bii bi2 
yb21 &22 




Fig. 1. Left: The direction field of the ERD of the prisoners game. Right: The paths 
induced by the learning process 



In the second subclass none of the players has a dominated strategy. But both 
players receive the highest payoff by both playing their first or second strategy. 
This is expressed in the condition (an — a2i)(&n — &12) > 0. The third subclass 
only differs from the second in the fact that the players do not receive their 
highest payoff by both playing the first or the second strategy. This is expressed 
by the condition (an — a2i)(f>n — 612) < 0. In the following three subsections 
we describe the results of the experiments conducted in each subclass. In all 
subclasses we used the following general settings, 

— ry is set at 300 

— stepsize is set to 0.003 

— theta is set to 0.0006 



4.1 Category 1: Prisoners Dilemma 



In category I we considered the prisoners dilemma game [13,3]. In this game both 
players have a dominant strategy, more precisely defect. The payoff matrices for 
this game are as follows. 



1 5 
0 3 



and 



1 0 
5 3 



In figure 1 the replicator dynamic of the game is plotted using the differential 
equations of 6. 

More specifically, the figure on the left illustrates the direction field of the 
extended replicator dynamics and the figure on the right shows the learning 
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Fig. 2. Left: The direction field of the RD of the battle of the sexes game. Right: The 
paths induced by the learning proces 



process of algorithm 1. We plotted for both players the probability of choosing 
their first strategy (in this case defect). So, the first players probabilities are 
on the X-axis and the second players probabilities on the X-axis. As starting 
points for the learning process we generated 50 random points. In every point a 
learning path starts and converges to the equilibrium at the point (1, 1). As you 
can see all the sample paths of the reinforcement learning process approximate 
the paths of the RD. 



4.2 Category 2: Battle of the Sexes 

For the second game we considered the battle of the sexes game, defined by the 
following payoff matrices [13,3]: 



2 0 
0 1 



and 




Figure 2 demonstrates the results. On the left you see the direction field of this 
game, on the right the sample paths induced by the learning process. You can see 
3 equilibria: two pure equilibra at (0, 0) and at (1, 1), and one mixed at (2/3, 1/3). 
Now we have convergence to the 2 strict equilibria. The third equilibrium is very 
unstable as you can see in the direction field plot. This instability is the reason 
why it will not emerge from the learning process on the long run. Again we used 
a grid of random starting points. 



4.3 Category 3: 



The third class consists of the games with a unique mixed equilibrium. We 
considered the following game. 



2 3 
4 1 



and 



3 1 
24 
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Fig. 3. The direction field plot of the RD for subclass 3 





Fig. 4. Left: The direction field of the RD. Right: The paths induced by the learning 
process 



Typical for the traditional RD in this class of games is that the interior 
trajectories define closed orbits around the equilibrium point. Figure 3 illustrates 
this. 

This type of game shows an important difference with our ERD and the 
matching learning algorithm. ERD and the matching learning algorithm will 
not circle but converge to the mixed Nash equilibrium. This is illustrated in 
figure 4. Moreover the equilibrium is stable, meaning that the learning process 
will not abandon it. The long-run learning dynamics are illustrated in the figure 
on the right. Again we used a grid of random starting points for the learning 
process. 

5 Conclusion 

In this paper it is shown that the RD from EGT are an adequate basis for 
reinforcement learning in games. This opens a new perspective on developing re- 
inforcement learning algorithms for multi-state games and Multi- Agent Systems. 
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More precisely we showed that with an extension of the traditional replicator 
equations (ERD), Nash equilibria can be attained in all kind of games. Based 
on this new dynamics we constructed a RL-algorithm that converges to this ex- 
tended replicator dynamics. In [12] we showed that for the matter of one-state 
games Cross learning is the simplest learning model (over Learning Automata, 
Q-learning) and suffices to attain the same results as the other learning models. 
It turned out that the Cross model keeps things most simple in the sense of 
setting parameters and computational effort. The experiments confirmed that 
with the Cross model, the Nash equilibria can be reached in the most elegant 
way. Therefore this new algorithm, extending Cross Learning and guaranteeing 
a stable Nash equilibrium, is sufficient for any type of one-state game. 

In a next phase, these results will be extended to multiple state games. De- 
veloping such algorithms will be based on Learning Automata and Q-learning, 
two possible techniques for multi-state games. In both techniques the connection 
with the Replicator Dynamics has been proved [10,11]. 
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Abstract. Bayesian inference often requires approximating the poste- 
rior distribution with Markov Chain Monte Carlo (MCMC) sampling. A 
central problem with MCMC is how to detect whether the simulation has 
converged. The samples come from the true posterior distribution only 
after convergence. A common solution is to start several simulations from 
different starting points, and measure overlap of the different chains. We 
point out that Linear Discriminant Analysis (LDA) minimizes the over- 
lap measured by the usual multivariate overlap measure. Hence, LDA 
is a justihed method for visualizing convergence. However, LDA makes 
restrictive assumptions about the distributions of the chains and their 
relationships. These restrictions can be relaxed by a recently introduced 
extension. 



1 Introduction 

Probabilistic generative modeling is one of the theoretical foundations of current 
mainstream machine learning and data analysis. Bayesian inference makes very 
accurate but computationally intensive predictions possible, and gives rigorous 
methods for model selection and complexity control. In a nutshell, the uncer- 
tainty in the data is converted into uncertainty of the model parameters in the 
form of a distribution. Inference of parameter values and of predictions is then 
done based on this distribution. 

Bayesian inference is potentially very powerful but closed-form solutions are 
seldom available. Inference has to be based on either sophisticated approximation 
methods or simulations with Markov Chain Monte Carlo (MCMC) [1] sampling. 
MCMC sampling is a very versatile yet computationally intensive procedure. 
The main practical problem of MCMC is how to assess whether the simulation 
has converged. The resulting samples come from the true distribution only after 
convergence. 

There are several strategies for monitoring convergence [2] . Often in practice 
convergence is assessed by starting the simulation from several different initial 
conditions, and by monitoring when the different simulation chains become suf- 
ficiently mixed together. The mixing can be monitored visually on scatter plots 
of the MCMC samples against all pairs of variables, which is of course feasible 
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only for models with few parameters. An alternative is to measure convergence 
quantitatively; measures of the overlap of the different sampling chains have 
been proposed by Brooks and Gelman [3] . The measures have the problem that 
rules of thumb are required for deciding whether the simulation has converged 
or not, and hence they are often complemented with visualizations. The other 
advantage of visualizations is that they are useful also for analyzing reasons of 
convergence problems. 

It turns out that the main multivariate convergence measure equals the cost 
function of a one-dimensional LDA (for a definition of LDA see [4]), a method 
that discriminates between data classes. Here the classes are the different sam- 
pling chains. Our first main result or suggestion is to use LDA for visual eval- 
uation of convergence. It has a rigorous criterion for visualizing convergence 
and complements the existing quantitative measures. Our second main result is 
an extension of the LDA visualization by applying a less restrictive measure of 
the overlap of the chains, resulting in a connection with a recent extension of 
discriminant analysis. 



2 Bayesian Modeling in a Nntshell 



In Bayesian modeling the relationship between the data y and the parameters 
6 of the model is defined by the likelihood p{y\6). Knowledge about the param- 
eter values before observing the data is given by the prior distribution p{9). 
By combining these we get the posterior distribution that represents our knowl- 
edge about the parameter values after observing the data. The posterior can be 
calculated from the prior and likelihood with the Bayes formula 



p{S\y) 



p{y\e)p{9) 

J p{y\9)p{9)d9’ 



( 1 ) 



where f p(yj9)p(9)d9 is a normalizing term. 

While in maximum likelihood estimation a single parameter value is sought, 
in Bayesian data analysis the result is the whole posterior distribution. This 
makes it possible to take our uncertainty about the parameter values into account 
in inference. A Bayesian model can be used to predict new values y according 
to the posterior predictive distribution 



p(iily) = J p{yW)p{d\y)d9 , (2) 

where the uncertainty of the parameter values has been taken into account by 
integrating over the posterior distribution. 

In practice the posterior distribution is usually not known in closed form 
and has to be approximated. A common method for approximation is MCMC 
sampling. MCMC generates samples Xt that are distributed proportionally to 
the posterior distribution. These samples can be used to estimate any statistic 
of the distribution and integrals over the posterior get approximated with sums 
over samples. 
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3 Monitoring Convergence Using Multiple Sequences 

3.1 Measuring Convergence 



One of the most common methods for monitoring MCMC convergence is the po- 
tential scale reduction factor (PSRF) proposed by Gelman and Rubin [5]. Mul- 
tiple MCMC sequences are started from different (over dispersed) initial points 
and compared. At convergence the chains should come from the same distribu- 
tion, which is assessed by comparing the variance and mean of each chain to the 
variance and mean of the combined chain. 

The PSRF is defined for one-dimensional data as follows. A number (m) of 
parallel chains are started, with 2n samples each. Only the last n potentially 
better converged samples from each chain are used. The between-chain variance 
B/n and pooled within-chain variance W are defined by 



B 

n 



1 



m — 1 



I] (u- 

i=i 



r..)^ 



and 



W = 



1 

m(n — 1) 



771 n 



H H 

j=i t=i 




( 3 ) 

( 4 ) 



where xj. is the mean of the samples in chain j and x.. is the mean of the 
combined chains. 

By taking the sampling variability of the combined mean into account we get 
a pooled estimate for the posterior variance 



V = 



n — 1, 



-W 




B 

n 



( 5 ) 



Finally an estimate R of PSRF is obtained by dividing the pooled posterior 
variance estimate with the pooled within chain variance, 




(6) 



If the chains have converged, the PSRF is close to one, which makes it a useful 
indicator of convergence. It is not a perfect indicator, however, since it does not 
guarantee convergence. The chains might not have traveled the whole state space 
yet and might discover possible new areas of high probability. Additionally, it 
does not take higher-order moments into account, only the mean and variance, 
and it is applicable to only one variable at a time. 

Brooks and Gelman [3] have extended the PSRF to a multivariate version, 
MPSRF. It is defined, similarly to the univariate PSRF, in terms of the estimate 
of the posterior covariance matrix V, which we get from (5) by replacing the 
scalar variances B/n and W with the covariance matrices 
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n 

W = 





(7) 


J — 

m n 

mi (^f* “ “ ^f)^ ■ 

> j=i t=i 


(8) 



In the multivariate case the comparison of within-chain variance to the pooled 
variance requires comparing the matrices. Brooks and Gelman chose to summa- 
rize the comparison by a maximum root statistic which gives the maximum 
scale reduction factor of any linear projection of x. The estimate of MPSRF 
is defined by 



a^Va 

= max 

a Wa 

n — 1 / m -I- 1 



n — 1 / m -I- 1 

-k 



max 

a 



a^Ba/n 

a^Wa 



^1) 



(9) 

( 10 ) 

( 11 ) 



where the Ai is the largest eigenvalue of the matrix W~^B/n. 

This criterion is very closely related to linear discriminant analysis (LDA). 
The goal of (a one-dimensional) LDA is to find the linear transformation y = a^x 
that maximizes the variance between classes, relative to the variance within 
classes. More formally, LDA solves the problem 



max 

a 



a^Bs^a 

a^W,,a ’ 



( 12 ) 



where B^^ and are the between and within sum of squares and cross prod- 
ucts (SSCP) matrices which differ only by a constant scale from the correspond- 
ing covariance matrices. This is a generalized eigenvalue problem, and its solution 
a is the eigenvector corresponding to the largest eigenvalue of W^g^Bgg. 

Hence, disregarding the constants, MPSRF equals the cost function of (a 
one-dimensional) LDA. In other words, optimizing the LDA is equivalent to 
choosing the component that best detects convergence, in the sense of MPSRF. 
Monitoring convergence by MPSRF or by the LDA cost function is equivalent; 
if the chains can be discriminated, then they have not converged. 



3.2 Visualizing Convergence 

Current practice. It is common practice to complement the convergence mea- 
sures by visualizations of the MCMC chains. Visualizations are useful especially 
when analyzing reasons of convergence problems. Convergence measures can only 
tell that the simulations did not convergence, not why they did not. 

MCMC chains have traditionally been visualized in three ways. Each variable 
in the chain can be plotted as a separate time series, or alternatively the marginal 




436 



Jarkko Venna, Samuel Kaski, and Jaakko Peltonen 



distributions can be visualized as histograms. The third option is a scatter or 
contour plot of two parameters at a time, possibly showing the trajectory of 
the chain on the projection. The obvious problem with these visualizations is 
that they do not scale up to large models with lots of parameters. The number 
of displays would be large, and it would be hard to grasp the underlying high- 
dimensional relationships of the chains based on the component-wise displays. 

Some new methods have been suggested. For three dimensional distributions 
advanced computer graphics methods can be used to visualize the shape of the 
distribution [6]. Alternatively, if the outputs of the models can be visualized 
in an intuitive way, the chain can be visualized by animating the outputs of 
models corresponding to successive MCMC samples [7]. These visualizations 
are, however, applicable only to special models. 

A principled way of visualizing convergence. The worst problem with the straight- 
forward visualization methods is that they lack the means to focus on visualizing 
variables or dimensions that are relevant for convergence. This worsens the prob- 
lems caused by the required large number of plots. 

In the previous Section (3.1) it was noted that the MPSRF measure of MCMC 
convergence (10) is closely related to linear discriminant analysis (LDA). We will 
use this connection to justify the use of LDA to visualize the convergence of the 
MCMC sampler. 

In summary, LDA finds a projection that best separates the classes in the 
sense of maximizing the between-class variation relative to within-class variation. 
For a one-dimensional projection this was shown to be equivalent to choosing 
MPSRF as the criterion for the projection. 

There is no reason to confine the visualization to be one-dimensional. LDA 
chooses the second direction or projection axis to be the eigenvector correspond- 
ing to the second largest eigenvalue, etc. A AT-dimensional LDA then maximizes 
^k, the relative between-chain variance representable by the K directions 
together. This criterion could actually be used as an alternative convergence cri- 
terion to MPSRF; it takes directly into account deviation in several directions 
instead of only the dominant one. 

When LDA is used to visualize MCMC convergence we in effect try to find 
a linear transformation that visualizes the convergence problems as clearly as 
possible, in the sense of the (extended) MPSRF measure. 

3.3 Informative Components 

Brooks and Gelman [3] noted that any statistic calculated from the separate 
chains should be equal to the one calculated from the combined chain when the 
chains have reached convergence, as the distributions should then be the same. 
The LDA connection above resulted from comparing means and variances. We 
propose that instead of comparing a statistic, a more general measure would 
result from comparing the distributions themselves. A natural measure is the 
mutual information between the distributions and the chain index. The difference 
between this and the LDA (MPSRF) criterion is discussed below. 




Visualizations for Assessing Convergence and Mixing of MCMC 437 



Problems with LDA. LDA assumes that each class is normally distributed with 
the same covariance matrix in each class. If the assumptions are correct, LDA 
discriminates between two classes optimally. This does not hold in general, how- 
ever, in particular not before MCMC convergence for small data. 

Another problem surfaces when generalizing LDA to several classes. The 
objective considers only pairwise divergences between classes, and no longer 
corresponds to optimal discrimination. See the Appendix for details. 

To address the above problems, we suggest to complement LDA-based anal- 
ysis with a generalization of LDA. The projection is linear but the assumptions 
about the distribution of data are relaxed. 

Relevant component analysis. A recent method for finding informative or rele- 
vant components directly maximizes their class-prediction power [8]. Formally, 
the conditional (log) likelihood 

L=Y^ logp(c|W^x) (13) 

(x,c) 

of classes is maximized within the subspace formed by the components. Here 
X is the sample, c is its class, and W is the (orthogonal) projection matrix 
whose columns are the component directions. The optimal projection is specific 
to the number of components sought. The well-defined objective for finite data, 
the likelihood, is asymptotically equivalent to the mutual information between 
components and classes. The task of finding such components was coined relevant 
component analysis (RCA). A sketch of the connection between LDA and RCA 
is presented in the Appendix. 

In this paper the c are the different chains, and RCA maximizes the (log) 
likelihood of correctly guessing which MCMC chain each sample is from. For 
converged chains one cannot (asymptotically) do better than a random guess; 
hence, large likelihood indicates non-convergence which can be assessed visually 
from the RCA projection. 

With finite data, we do not know the exact densities p(c|W^x), but we can 
optimize the projection parameters by using a nonparametric estimate p(c| W^x) 
in the projection space. Since this estimate is non-parametric, RCA makes no 
distributional assumptions. For details on RCA and its optimization, see [8]. 
Technically, we replaced the stochastic gradient in [8] by conjugate (batch) gra- 
dient optimization. 

The main justification for using RCA here is that it maximizes a flexible 
measure of separation of the classes. It remains an empirical question of how 
much the RCA improves the LDA-based visualizations. In Section 4.2 we apply 
both methods to assess convergence in a relatively simple task. 

4 Analysis of a MCMC Run 

To demonstrate visual analysis of a MCMC sampler we have chosen a data 
set that contains reaction times for schizophrenics and nonschizophrenics. The 
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model and the problem are described in the book Bayesian Data Analysis [9] 
(Example 16.4, p.426) and were also used to illustrate the use of the PSRF 
measure in the original article [5]. 

The data consist of (log) reaction time measurements from 11 nonschizophren- 
ics and 6 schizophrenics. Each person had their reaction time measured 30 times. 
It is believed that schizophrenics suffer from attentional deficit on some mea- 
surements as well as an overall motor reflex retardation. 

For the nonschitzophrenics the reaction time is modeled as a random-effects 
model with a distinct mean aj for each person and a common variance a^. 
The reaction times for the schizophrenics are modeled with a two-component 
Gaussian mixture. With probability (1 — A) there is no attention lapse and the 
response time has mean aj and variance ay. With probability A there is a delay 
and the response time has mean aj +t and the same variance To address the 
question about the amount of motor reflex retardation a hierarchical population 
model is devised. The means of the reaction times aj are modeled to be normally 
distributed with a mean ^ for the nonschizophrenics and a mean ^ + (3 for the 
schizophrenics. The model can be expressed as 

y^j I otj , , 0 ~ N {aj + rCij ,al), (14) 

aj\Cij,4> ^^{fJ- + PSj,al), (15) 

Cij 10 ~ Bernoulli (AS'j), (16) 

where 0 = {a'^, P, X,t, ^,ay) contains the hyperparameters and yij is the re- 
sponse i from person j. The term Sj is an indicator that equals 1 for schizophren- 
ics and 0 for nonschizophrenics, and Qj is an unobserved indicator that equals 
1 if the observation arose from a delayed response and 0 otherwise. 

The hyperparameters in 0 are assigned a noninformative uniform prior den- 
sity. Additionally, t, and a^ are restricted to be positive. The mixture pa- 
rameter A is further restricted to the interval [0.001, 0.999]. As all necessary con- 
ditional distributions were readily available, Gibbs sampling, a form of MGMG, 
was used. 

Ten chains of 1500 samples each were generated from random starting po- 
sitions. The MPSRF measure showed that the sampling had not converged. 
Galculating the univariate PSRF measures for the 23 variables we were inter- 
ested in (all except the indicator variables Qj) showed that several variables had 
not converged. At this point we still had no idea what had gone wrong with the 
sampler, or was the convergence just slow. 

4.1 Visualization with LDA 

Gaining insight on the problem. In order to better understand the behavior of 
the chains we visualized a part of the simulation, samples [200, 600] around the 
point 350 after which the MPSRF measure seemed to have stabilized at a high 
value. 

It is clear from the LDA projection (Fig. 1) that there are five distinct clusters 
in the sample set. By color coding (not shown) the different chains with different 




Visualizations for Assessing Convergence and Mixing of MCMC 



439 



Chain 2 Chains 




Fig. 1. Two-dimensional LDA projection of all samples from the interval [200,600]. 
The ellipses have been drawn by hand to mark the chains. 



colors it was easy to identify the chains. Six of the chains were clustered together 
and the other four formed a separate cluster each. Three of the chains were 
separated from the main cluster on discriminative component 1 and one on 
component 2. We additionally checked whether any of the separate chains could 
still be moving toward the common cluster, by color coding based on sampling 
time. There was no visible hint of that. 

Verifying the findings. A further study showed that four of the chains had ended 
up in a degenerate part of the parameter space, that is, in a part where the 
mixture model has collapsed to a one-component model, already very soon after 
the initialization. For three of these chains (chains 1, 2, and 3) the probability 
of a sample being generated by a delayed mixture component was so low that 
no samples were assigned to it. This was apparent already by a quick look at 
the one-dimensional time series plots of these chains. The delay parameter t had 
not changed at all from the starting position. 

The reason for the fourth chain appearing separated is the reverse. Nearly all 
samples came from the mixture component representing delayed measurements, 
and hence the ft and t could not be identified separately. It was harder to 
diagnose the problem with this chain because the time series plots looked normal. 
The LDA visualization in Figure 1 helped to quickly identify the problem areas. 

Checking the behavior of the sampler near convergence. At this point we could 
have modified our model or our sampler to remove the problems. If there are a 
sufficient number of chains, a rapid alternative is to discard the degenerate ones. 
We computed the MPSRF measure again for the remaining chains. It is clear 
from Figure 2a that this time convergence has been reached after about 350 
samples. For a demonstration we created a new LDA projection showing only 
the nondegenerate chains. In Figure 2b we can see that there are two ’tails’ from 
chains which are moving toward the common distribution. By color coding the 
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Fig. 2. a) MPSRF measure calculated from the nondegenerate chains (5-10). b) LDA 
projection of the nondegenerate samples from the interval [200, 600]. The ellipses have 
been drawn by hand to mark early samples from chains 5 and 6. The samples can be 
visualized by a time-based color code. 




Fig. 3. a) 2D RCA projection of all samples from the interval [200,600]. b) Enlarged 
view of the box in lower right corner of a. 



samples based on time we verified that the samples were indeed early samples 
and that the two chains became combined with the other chains after the early 
samples. Thus we could conjecture that the simulation had converged this time. 



4.2 Visualization with RCA 

We finally compare qualitatively the less restrictive RCA projection with LDA 
to verify that it gives the same or better insights on convergence. 

From the two-dimensional RCA projection of all samples from the interval 
[200, 600] (Fig. 3) we can see that RCA has discovered the same five clusters 
as LDA. Four of the clusters are composed of a single chain each, and the last 
consisted of six chains. In addition, RCA has found the two ’tails’ of samples, 
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generated by two chains converging toward the multi-chain cluster. These are the 
same ’tails’ that were found using LDA on the nondegenerate chains (Fig. 2b). 

Chains 1 and 2 are far from the others in both the LDA and the RCA 
visualizations. However, the LDA visualization kept the chain 4 far apart as well, 
whereas RCA placed it closer to chains 5-10 and instead separated the ’tails’ of 
early samples of chains 5 and 6. Since the chain 4 can still be discriminated well, 
this yields a more informative projection. 

In conclusion, RCA visualization displayed all the discovered convergence 
properties in a single two-dimensional visualization. No additional studies were 
required as with LDA. (A visualization corresponding to Figure 2b was computed 
just in case, and revealed only the same properties.) 



5 Discussion 

We have shown how to create visualizations for MCMC convergence analysis 
with linear discriminant analysis (LDA). Problems can be identified quickly us- 
ing only a few visualizations. Justification for LDA comes from its connection 
to a common convergence measure: Its goal is to separate the different simula- 
tion chains, and if it is successful the simulation has not converged. This was 
demonstrated in a case study. 

It is straightforward to extend the black-and-white visualizations of this Pro- 
ceedings with color coding. If the different chains are colored differently it is easy 
to distinguish them in the figures. Coloring samples with shades that change as 
a function of time brings visible the evolution of the chains during sampling. 
Further possibilities for extensions are coloring according to the likelihood of 
the sampled models, or coloring according to the prior or posterior density of 
the samples. This would clearly show how much the posterior differs from the 
prior, for example. 

If more details about the behavior of the sampler are of interest, some more 
technical measures like acceptance ratio or autocorrelation within a window 
around the sample could be visualized by the color code. This could possibly 
identify areas where the sampler is performing poorly. These ideas could be 
combined in an interactive visualization tool aimed at easy exploratory analysis 
of the behavior of a MCMC sampler. 

Even though LDA can be used for principled visualizations of MCMC chains, 
it is based on assumptions that often do not hold. It assumes normally distributed 
chains, which usually does not hold, and that the covariance matrices of the 
chains are the same, which holds only after convergence. A new method, RCA, 
is based on a more flexible measure of the overlap of the simulation chains: The 
likelihood of predicting the chains, which asymptotically becomes the mutual 
information. These theoretical connections justify the use of the RCA, and it 
was demonstrated to work better than LDA in a small case study. 

Finally, the objective function of RCA could additionally serve as a measure 
of convergence, when compared with a naive estimate that simply predicts the 
overall chain proportions. If the values are different, MCMC has not converged. 
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Appendix: Connection between LDA and RCA 



Reformulating LDA. For simplicity, consider only the first LDA component a. 
Denote cr^ = a'^'WssS./N, where N is the total number of samples. The LDA ob- 
jective equals the variance of class centers along the projection direction, relative 
to the within-class variance: 



a^Bssa _ 1 Tp. _ (a'^(xc, - x,,))^ 

a^Ws^a Na'^ ** N ctJ 

tx c ^ 



(17) 



Since, for a scalar variable x, Exi,x 2 [{xi — X 2 )^] = 2E[x'^] — 2(A[x])^ = 2E[{x — 
E[x]y], the objective further equals (up to a constant multiplier) the weighted 
sum of squared distances between class pairs: 

2 Tt3 (a'^(xci, - Xc2.))^ 

—a B..a- ^ — 

a Ci,C2 ^ 



(18) 
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Since a^a = 1 , each Gaussian class has a variance of along the projection 
dimension. Then, for each pair of classes ci and C2, the rightmost term equals 
the squared Mahalanobis distance of the projected class centers along the projec- 
tion. This in turn equals the following symmetrized Kullhack-Leihler divergence 
between the distributions along the projection [ 10 ]: 

^(a'^(xci.-Xc2.))^ = £>A:L(p(a^x|ci),p(a^x|c2))-l-T>ifL(p(a^x|c2),p(a^x|ci)) 
O-a 

( 19 ) 

LDA thus maximizes a sum of symmetrized Kullback-Leibler divergences be- 
tween the classes along the projection, weighted by the fractions nc^nc^/N"^. 

Improving the cost function. Optimizing the above objective ( 18 ) does not re- 
sult in optimal discrimination. We will improve it in two steps. First, for each 
class pair (ci,C2), replace the symmetrization in ( 19 ) with the Jensen- Shannon 
divergence. This helps to reinterpret the objective in a form that can be easily 
generalized. For brevity, denote y = a^x, denote the proportions of the class 
prior probabilities by Pci = p{ci) / {p{ci) + p{c2)) andpcs = p{c2) / {p{ci) + p{c2)) , 
and set q{y) = PciP{y\c\) + Pc2P{y\c-2) = p{y\c\ V C2), where c\ V C2 referes to the 
distribution containing only clases c\ and C2. The Jensen-Shannon divergence is 

Djs{p{y, ci),p{y, C2)) = PciDkl {p{y\ci),q{y)) -f Pc^Dkl {p{y\c2) , q{y)) 

= J P(y|ci) log dy pc^ j p(i/|c2) log 

= / X! P(2^|c)Pclog^^^^dy = /(y,c|ci V C2) . ( 20 ) 

J 9(2/) 

LDA then finds (roughly, due to the different symmetrization) the direction that 
maximizes the sum of pairwise mutual informations between classes, weighted by 
the class proportions. This suggests the natural extension to consider more than 
just pairwise class interactions, and maximize the complete mutual information 
I{c,y) between classes and projected data. It can be shown that as the amount 
of data grows, the likelihood objective of RCA asymptotically equals I(c, y), up 
to a constant. RCA is then a finite-data implementation of an LDA extension. 
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Abstract. We propose a method to improve the probability estimates 
made by Naive Bayes to avoid the effects of poor class conditional prob- 
abilities based on product distributions when each class spreads into 
multiple regions. Our approach is based on applying a clustering algo- 
rithm to each subset of examples that belong to the same class, and to 
consider each cluster as a class of its own. Experiments on 26 real-world 
datasets show a significant improvement in performance when the class 
decomposition process is applied, particularly when the mean number of 
clusters per class is large. 



1 Introduction 

Probabilistic classifiers constitute a major venue of research in data mining, 
pattern recognition, and machine learning. Successful applications are found in 
speech recognition, document classification, and medical diagnosis, among many 
others. We focus on a popular probabilistic classifier based on the assumption of 
attribute independence, also known as Naive Bayes; the performance of this sim- 
ple classifier is unexpectedly often similar to other classifiers unrestrained by the 
attribute independence assumption. Although the reasons explaining the com- 
petitiveness of Naive Bayes remain unclear, several studies have revealed useful 
information; examples include studies about the conditions for its optimality [2] ; 
its geometric properties [15]; and how the product distribution implied by the 
independence assumption compares to most other joint distributions with the 
same set of marginals [5]. 

This paper reports on a method to improve the performance of Naive Bayes 
by attending to the distribution of examples in the input-output space. We work 
on the characterization and transformation of data rather than on the algorithm 
design. The idea is to transform the data by decomposing each class into clusters; 
this is useful to avoid the effects of poor class conditional probabilities based on 
product distributions when each class spreads into multiple regions. In contrast, 
most previous work looks at improving the algorithm design alone; examples in- 
clude adjusting the estimated probabilities [12], improving probability estimates 
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[7, 14], and combining Naive Bayes with other models [8]. Some previous work 
does transform the data by searching for attribute dependencies to construct 
new features [4]; our approach differs in that the transformation is made over 
the class distribution by looking at each cluster as a new class. Our main idea is 
to augment the number of original classes according to the example distribution 
to improve the probability estimations made by Naive Bayes. 

Our experimental results, obtained using 26 datasets from the University of 
California at Irvine repository, enable us to provide an explanation for the com- 
petitiveness of Naive Bayes in real-world domains. In summary, most domains 
exhibit a distribution characterized by few clusters per class; a situation where 
Naive Bayes is known to perform well. In these cases the performance of our 
proposed approach is almost identical to Naive Bayes. But when a domain is 
characterized by many clusters per class (on average) the estimation of class- 
conditional probabilities is biased, and Naive Bayes performs poorly. In these 
cases our approach can improve the performance of Naive Bayes significantly. 
The fact that few domains exhibit many clusters per class explains why Naive 
Bayes often appears at the same level of performance as other (more sophisti- 
cated) algorithms. 

This paper is organized as follows. Section 2 introduces background informa- 
tion on classification, probabilistic classifiers, and Naive Bayes. Section 3 explains 
why Naive Bayes is expected to yield poor probability estimates under certain 
kinds of input-output distributions. Section 4 describes our class decomposition 
approach to improve and explain the performance of Naive Bayes. Section 5 com- 
pares our class-decomposition approach to local learning. Section 6 reports our 
experimental analysis. Finally, Section 7 gives a summary and discusses future 
work. 

2 Preliminaries 

Let (Ai, A 2 , ■ ■ ■ , An) be an n-component vector- valued random variable, where 
each Ai represents an attribute or feature; the space of all possible attribute 
vectors is called the input space X. Let {yi, j/ 2 , • • • , Uk) be the possible classes, 
categories, or states of nature; the space of all possible classes is called the output 
space y. A classifier receives as input a set of training examples T = {(x,y)}, 
where x = (ai, 02 , • • • , a„) is a vector or point of the input space and y is a point 
of the output space. We assume T consists of independently and identically 
distributed (i.i.d.) examples obtained according to a fixed but unknown joint 
probability distribution (f> in the input-output space Z = X x y. The outcome 
of the classifier is a function h (or hypothesis) mapping the input space to the 
output space, h : X ^ y. Function h can then be used to predict the class of 
previously unseen attribute vectors. 

We consider the case where a classifier defines a discriminant function for 
each class (/j(x), j = 1, 2, • • • , fc and chooses the class corresponding to the dis- 
criminant function with highest value (ties are broken arbitrarily): 



ft,(x) = ym iff 3m(x) > gj(x) 



( 1 ) 
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In probabilistic classifiers the discriminant functions are the posterior probabil- 
ities of a class given the input vector x, P{yj\x). Using Bayes rule^: 



gj{x) = P{yj\x) 



P{x\yj)P{y^) 

P{x) 



( 2 ) 



where P{yj) is the a priori probability of class yj, P{x\yj) is called the likelihood 
of yj with respect to x or the class-conditional probability, and P{x) is the 
evidence factor [3]. Since the evidence factor P{x) is constant for all classes we 
can dispense with it. Assuming all attributes are independent given the class 
yields the following discriminant function used by Naive Bayes: 



gj{x) = P{yj)Y[P{a,\yj) (3) 



where is the value of attribute Ai in vector x. The main idea is to approximate 
the joint input-output distribution through a product distribution by assuming 
attribute independence. While this is clearly unrealistic in many real-world ap- 
plications, experimental results have repeatedly demonstrated that Naive Bayes 
often performs as well as other algorithms that make no attribute independence 
assumption. Our goal in this paper is to relate the performance of Naive Bayes 
to the characteristics of a domain; the derived analysis shows a clear mechanism 
to improve the performance of this probabilistic classifier. 



3 A Perspective View of Naive Bayes 

Although the behavior of Naive Bayes has been explained from different perspec- 
tives [2, 15, 5], an understanding of the degree of match between different target 
distributions and the set of assumptions or bias embedded by the algorithm re- 
mains unclear. In this section we identify a kind of distributions for which the 
product approximation of Naive Bayes may result in multiple misclassifications; 
we name this problem the class-dispersion problem. 



3.1 Maximum Entropy and Approximating Distributions 

We begin by studying the implication behind a product approximation. Our main 
assumption is that the set of training examples T is drawn from an unknown 
but fixed probability distribution (j) that defines P{x, y) for every point in the 
input-output space. Naive Bayes assumes distribution (j) can be approximated 
through a product of low order components (i.e., product of marginals) assuming 
attribute independence given the class (equation 3). The following definitions will 
be instrumental in characterizing the approximation used by Naive Bayes. 



^ We assume features take on discrete values; we then have probability masses, rather 
than probability densities. 
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Definition 1. A distribution 0max over the input-output space Z is called a 
maximum entropy distribution if it assumes equal probabilities over all elements 
in Z (i.e., if it corresponds to a uniform distribution over all possible elements 
in Z). The entropy of denoted as is as high as possible: 

1^1 1 1 

where |-Z| is the size of the input-output space. 

Definition 2. The information contained in a probability distribution (p over 
the input-output space Z, defined as is the difference between the entropy 
of the maximum entropy distribution and the actual entropy of (p: 

h = ( 5 ) 



where is defined as follows 



| 2 | 

h^ = ~Y. 



and each Pi is the probability of element i in Z according to p. 

An interpretation of Definition 2 is straightforward: a flat distribution where 
all elements are assigned equal probabilities carries no information, whereas the 
more peaked a distribution, the higher the information conveyed by such distri- 
bution [9]. 

We now consider the problem of approximating a true distribution p using 
an approximation p' . Let us suppose all we know about </> is a set of low order 
component distributions L. All we require from approximation p' is that it must 
reduce to the same set of low order components in L (i.e., that it can be expressed 
as function of the low order components in L) . Approximating distributions can 
be categorized by the amount of information they contain. If the approximation 
is based on the idea of providing the least amount of additional information 
beyond the set of low order components, then we have a maximum entropy 
approximating distribution. 



Definition 3. Let p be the true distribution over the input-space Z and let L 
be a set of low order components to which p can be reduced. An approximating 
distribution of p with respect to L is called a maximum entropy approximating 
distribution, denoted as , if among all distributions p'j^ that reduce to the 

same set of low order components L, is the one with maximum entropy 

or less information: 






( 7 ) 



for all distributions p'j^ that reduced to the same set of low order components L. 
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3.2 The Product Approximation of Naive Bayes 

We now return to the product approximation followed by Naive Bayes. It can 
be shown that a product approximation contains the smallest amount of infor- 
mation (or maximum entropy) of all possible approximations to (p that reduce 
to the same set of low order components. In other words, Naive Bayes is a max- 
imum entropy approximating distribution [9]. We formalize this as follows: Let 
L be the set of low order components used by Naive Bayes. That is, for every 
class Vj, let L = {Pijjj), P{ai\yj), P{a 2 \Vj), • • •, P(a„|yj)}. Let be the 
product approximation corresponding to Naive Bayes, and let 4>'j^ be any other 
approximation different from Naive Bayes that reduces to the same set of low 
order components in L. Then irrespective of the nature of it is always true 
that contains more (or equal) information than 

What is the implication behind the product approximation of Naive Bayes? In 
brief, such approximation tries to reconstruct the true distribution from the set of 
low order components assuming as little additional information as possible; hence 
the distribution is maximally flat. Naive Bayes displays a homogeneous class 
distribution on all regions of examples for which the set of low order components 
is identical. 

As an illustration. Figure 1-left shows an input-output distribution on two 
classes: positive {y\ = -I-) and negative ( 7/2 = — )• We assume a two-dimensional 
space where attribute A\ takes on three values, and attribute A 2 takes on two 
values. Since we have equal class proportions {P{+) = P{—) = 5), the classifi- 
cation depends on the likelihoods only. Figure 1-right shows the approximation 
made by Naive Bayes. The product approximation tends to smooth all probabil- 
ities. According to Naive Bayes the distribution is now the same along A 2 = 1, 
with a likelihood ratio in favor of the negative class, and along A 2 = 2, with a 
likelihood ratio in favor of the positive class. 

Consider example x = {A\ = 2, A 2 = 1) as shown in Figure 1. Bayes (opti- 
mal) classifier assigns x to class positive (Figure 1-left). The situation changes 
completely for Naive Bayes (Figure l-right). Since P{Ai = 2|-|-) = P{Ai = 
2| — ) = i, the classification for x hinges on P{A 2 = 1|?/) exclusively; Naive Bayes 
assigns x to class negative because P{A 2 = 1|— ) = | > P{A 2 = l|-l-) = The 
mistake incurred by Naive Bayes stems from the assumption behind a maximal 
entropy distribution. The existence of regions that are class uniform is blurred 
by Naive Bayes’s vision; these regions are simply averaged altogether when pro- 
jected onto each attribute. 

3.3 The Class-Dispersion Problem 

The problem we are addressing is characteristic of distributions where clusters 
of examples that belong to the same class are dispersed throughout the input 
space. We call this the class-dispersion problem. In this case, clusters are hard 
to identify because a single-dimensional projection of the data loses their spatial 
information. This is related to the small disjunct problem in classification [6], 
where the existence of many small disjuncts (i.e., class-uniform clusters covering 
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Fig. 1. (left) The true distribution of examples in the input-output space, (right) The 
maximum-entropy approximation made by Naive Bayes. Example x is incorrectly clas- 
sified by Naive Bayes. 



few examples) may account for a significant amount of the total error rate. Our 
focus, however, is based on the distribution of clusters rather than their coverage. 

Our intuition is that Naive Bayes may perform better on domains where the 
examples of one class are clustered together. This intuition has some theoreti- 
cal justification. For example, a Boolean target function made of a disjunction 
(conjunction) of all attributes (or their negations) has only a single example of 
class 0 (1 for conjunction) and yields optimal (error- free) performance for Naive 
Bayes [2]. The optimality of Naive Bayes can be easily proven for a more general 
case of two-class problems where one of the classes is assigned to a single point 
[11], but the attributes are nominal rather than Boolean. We have extended this 
result showing that probability distributions having almost all the probability 
mass concentrated in one example are well approximated through a product 
distribution (see [11] for proof): 

Theorem 1. If for some 0 < < 1, 3 x* = (oj, ..., a*) such that P{al , ..., a* |j/j) 

>1-6, then Vx = (oi, ...,a„), \P{x\yj) - P(ai\yj)\ < n6. 

In these cases, although a product approximation does not guarantee good 
performance of Naive Bayes, it makes it more likely in practice. Nevertheless, as 
the target distribution changes such that each class groups into multiple clusters, 
the chances of misclassifications incurred by Naive Bayes increase greatly. This 
is because Naive Bayes tends to smooth out the class-conditional probabilities. 
In cases when instances of the same class are scattered, computing marginals 
(i.e. single-dimensional projections) of the data may result in significant loss of 
information. 



4 Decomposing Classes into Clusters 

Our solution to the class-dispersion problem can be summarized through a two- 
step process: 1) identify class-uniform clusters of examples in the training set, 
and 2) relabel each cluster as a new class of examples. The new dataset differs 
from the original training set in the class labelling: there is now an additional 
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Algorithm 1: Mapping-Process 
Input: clustering method C, dataset T 
Output: new dataset T' 

Mapping-Process(C',T) 

(1) Separate T into subsets {Tj} 

(2) where Tj = {(x, y) G T\y = yj} 

(3) foreach Tj 

(4) Apply clustering C on Tj 

(5) Let {Cp} be the set of clusters 

(6) foreach example e = {~x.,yj) 

(7) Let p be the cluster index for x 

(8) Create example e' = {^,y'j) 

(9) where j/' = 

(10) Add e' to T' 

(11) end 

(12) end 

(13) return T' 

Fig. 2. The process to transform dataset T into a new dataset T' using a clustering 
algorithm. 



number of classes. Naive Bayes is then trained over the new dataset. During 
classification, performance can be assessed by simply assigning each example 
back to its original class. A general description of our approach follows. 



4.1 The Data Transformation 

Let T = {{x,y)} be the input dataset. Our first step is to map T into another 
dataset T' through a class-decomposition process. The mapping leaves the input 
space X intact but changes the output space y into a (possibly) larger space y' 
(i.e., |X| > |3^|, where | • | is the cardinality of the space). 

The second step is to train Naive Bayes on T' to obtain hypothesis h' . The 
hypothesis acts over the transformed output space h' : X ^ y' . The classification 
of a new input vector x is obtained by applying a function g over h'{x) that will 
essentially bring the class label back to the original output space, g ■ y' ^ y ■ 



4.2 The Mapping Process 

The first step in the transformation process is shown in Algorithm 1 (Figure 2). 
We proceed by first separating dataset T into sets of examples of the same class. 
That is T is separated into different sets of examples T = {Tj}, where each Tj 
comprises all examples in T labelled with class j/j, Tj = {(x,y) G T\y = yj}. 

For each set Tj we apply a clustering algorithm C to find sets of examples 
(i.e., clusters) grouped together according to some distance metric over the input 
space. Let {C^} be the set of such clusters. We map the set of examples in Tj 
into a new set Tj by renaming every class label to indicate not only the class 
but also the cluster to which each example belongs. One simple way to do this 
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Fig. 3. The mapping process relabels examples to encode both class and cluster. 



is by making each class label a pair (a, b), where the first element represents the 
original class and the second element represents the cluster that the example 
falls into. In that case, T' = {(x,?/')}, where t/' = {yj,p) whenever example x is 
assigned to cluster C^. 

An illustration of the transformation above is shown in Figure 3. We assume 
a two-dimensional input space where examples belong to either class A or B. 
Let’s suppose the clustering algorithm separates class A into two clusters, while 
class B is grouped into one single cluster. The transformation relabels every 
example to encode class and cluster label. As a result, dataset T' has now three 
different classes. 

Finally the new dataset T' is simply the union of all sets of examples of the 
same class relabelled according to the cluster to which each example belongs, 

4.3 The Classification Process 

During the second step, Naive Bayes is trained over the new dataset T' producing 
a hypothesis h' mapping points from input space X to the new output space y' . 
Each discriminant function has the same form as Equation 3, but the number 
of discriminant functions is now (possibly) larger, according to how much the 
decomposition process divided up each class into multiple clusters. 

When classifying a new input vector x, hypothesis h' will output a prediction 
consisting of a class label and a cluster label, /i(x) = (yj,p), corresponding to 
original class yj and cluster C^. To know the actual prediction in the original 
output space y we simply apply a function g that removes the second element 
of the pair, g{yj,p) = y^. Essentially, we predict class label yj whenever example 
X is assigned to any of the clusters of class yj. 

The decomposition process aims at eliminating the cases where a class spreads 
out into multiple regions. As each cluster is transformed into a class of its own, 
the class-dispersion problem vanishes. The result is a new input-output space 
where each class sits in a tight region. By reducing the class-dispersion problem, 
the conditional probabilities estimated by Naive Bayes better conform with the 
assumption of a product distribution (i.e., of a maximum-entropy distribution). 
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5 Locality, Capacity, 

and the Class Decomposition Process 

A better understanding of our approach can be gained by looking at the dif- 
ference between locality, capacity, and the class-decomposition process. Naive 
Bayes is a global classifier: it makes use of all available data to estimate its 
parameters (i.e., a priori and class-conditional probabilities). As such it fails to 
detect local class variations given the same set of low order components. Failing 
to detect those variations is a byproduct of the attribute-independence assump- 
tion; Naive Bayes is a learning machine with low capacity (i.e., low flexibility 
in the decision boundaries) . To solve this problem one may introduce a form of 
locality in the global classifier, in which parameters are estimated based only on 
the neighborhood of the example x being classified. 

The class decomposition process discussed in Section 4 introduces an alter- 
native view to local classification: instead of focusing on local regions of the 
input space, we can augment the number of discriminant functions according to 
the class distribution. That is, one can add more decision boundaries but retain 
their low flexibility. By augmenting the number of discriminant functions, the 
capacity of the algorithm is in fact increased, but the flexibility of the boundaries 
remains the same. The trick lies in the clustering phase that in fact pre-identifies 
local structures in the data. In addition, separating classes into clusters simply 
reduces the dependencies between attributes, but retains all examples for anal- 
ysis. The class-cluster encoding computed during the transformation process 
(Figure 3) does not result in a loss of information with respect to the original 
sample distribution. 

6 Experiments 

We now report on a series of experiments that compare Naive Bayes (NB) with a 
modified version (NB') that computes the transformation described in Section 4. 
Our datasets (26 domains) can be obtained from the University of California at 
Irvine repository [I]. In what follows, predictive accuracy on each dataset is ob- 
tained using stratified 10-fold cross-validation, averaged over 5 repetitions; tests 
of significance use a two-tailed t-student distribution. The clustering algorithm 
follows the Expectation Maximization (EM) technique [10]; it groups examples 
into clusters by modelling each cluster through a probability density function. 
Each example in the dataset has a probability of class membership and is as- 
signed to the cluster with highest posterior probability. The number of clusters 
is estimated using cross-validation. Implementations of Naive Bayes and EM are 
part of the WEKA machine-learning class library [13], set with default values. 
Runs were performed on a RISC/6000 IBM model 7043-140. 

Table 1 displays our results. The first column describes the domains used for 
our experiments. The second and third columns report on the accuracy of Naive 
Bayes; the second column corresponds to the standard version and the third 
column to the version using the transformation described in Section 4 (numbers 
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Table 1. Predictive accuracy on real-world domains for Naive Bayes with and with- 
out the transformation process. Numbers enclosed in parentheses represent standard 
deviations. 



Domain 


Naive Bayes NB 


Naive Bayes with 
transformation NB' 


A Accuracy 


Number of Clusters | 


Mean 


Min 


Max 


Anneal 


86.64 (0.06) 


96.48 (0.09) 


9.84* 


3.4 


1.0 


5.0 


Audiology 


72.17 (0.20) 


72.17 (0.20) 


0.00 


1.0 


1.0 


1.0 


Autos 


57.89 (0.25) 


71.83 (0.80) 


13.94* 


2.66 


1.0 


5.0 


Balance-Scale 


90.48 (0.05) 


90.48 (0.05) 


0.00 


1.0 


1.0 


1.0 


Breast-Cancer 


73.30 (0.15) 


73.72 (0.26) 


0.42 


2.5 


2.0 


3.0 


Breast-W 


95.98 (0.01) 


95.98 (0.01) 


0.00 


1.0 


1.0 


1.0 


Colic 


78.50 (0.17) 


78.50 (0.17) 


0.00 


1.0 


1.0 


1.0 


Credit-G 


75.05 (0.27) 


73.26 (0.13) 


-1.79 


4.0 


3.0 


5.0 


Diabetes 


75.49 (0.07) 


74.98 (0.10) 


-0.51 


3.0 


1.0 


5.0 


Heart-C 


83.18 (0.07) 


84.16 (0.09) 


0.98* 


2.0 


1.0 


3.0 


Heart-H 


84.22 (0.24) 


83.52 (0.12) 


-0.70 


4.0 


2.0 


6.0 


Heart-Statlog 


84.28 (0.15) 


84.28 (0.15) 


0.00 


1.0 


1.0 


1.0 


Hepatitis 


84.24 (0.25) 


85.71 (0.13) 


1.47 


2.5 


1.0 


4.0 


Ionosphere 


82.25 (0.13) 


90.12 (0.15) 


7.87* 


4.5 


1.0 


8.0 


Iris 


95.36 (0.07) 


95.96 (0.37) 


0.60 


3.3 


2.0 


5.0 


Chess 


87.18 (0.35) 


90.62 (0.06) 


3.44* 


9.5 


9.0 


10.0 


Labor 


94.08 (0.42) 


94.08 (0.42) 


0.00 


1.0 


1.0 


1.0 


Lymph 


83.79 (0.18) 


83.79 (0.18) 


0.00 


1.0 


1.0 


1.0 


Mushroom 


94.01 (0.23) 


99.83 (0.01) 


5.82* 


5.5 


5.0 


6.0 


Tumor 


51.20 (0.21) 


51.20 (0.21) 


0.00 


1.0 


1.0 


1.0 


Segment 


79.73 (0.09) 


87.89 (0.68) 


8.16* 


4.57 


1.0 


11.0 


Sick 


93.84 (0.39) 


98.70 (0.04) 


4.86* 


8.5 


6.0 


11.0 


Vehicle 


44.96 (0.17) 


73.73 (0.03) 


28.77* 


8.0 


6.0 


10.0 


Vote 


90.07 (0.03) 


95.60 (0.13) 


5.53* 


2.5 


2.0 


3.0 


Vowel 


63.79 (0.15) 


92.05 (0.11) 


28.26* 


6.3 


4.0 


9.0 


Zoo 


94.92 (0.16) 


97.02 (0.00) 


2.10* 


1.28 


1.0 


2.0 


Average 


80.63 


85.22 


4.58 


3.42 


2.19 


4.80 



enclosed in parentheses represent standard deviations). The fourth column shows 
the improvement in accuracy that comes with our proposed approach (an asterisk 
at the top right of each number implies the difference is significant at the p = 
0.01 level). The last columns shows average values for the mean, minimum, and 
maximum number of clusters per class for every dataset. 

Our results show how the transformation process improves the accuracy of 
Naive Bayes in most of the datasets used for our experiments. Where no im- 
provement is observed the difference is not statistically significant; in the ex- 
treme case where each class is grouped into one single cluster, the performance 
of our proposed approach is identical to Naive Bayes. In some other domains, 
the improvement goes up to approximately 28% points (e.g., Vehicle and Vowel). 
The average improvement in accuracy is of approximately 4.5% points. Figure 4 
(left) shows the difference between our approach (NB') and Naive Bayes (NB) 
(y-axis) where domains are ordered according to the mean number of clusters 
per class (x-axis). Most significant differences correspond to domains with many 
clusters per class (we note the increase is not monotonic). 
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In addition, our results shed some light on the competitiveness of Naive 
Bayes in real-world domains. Figure 4 (right) shows a histogram of the mean 
number of clusters per class for each dataset. Most datasets exhibit a distribution 
characterized by few clusters per class, a situation that favors the assumption 
behind a product distribution. Few datasets exhibit many clusters per class, 
which explains why Naive Bayes often appears at the same level of performance 
as other (more sophisticated) algorithms. 




Bomains^Ordere3?v Mean^umber'bf Clusfers 



20 . 



j^^rs 




Fig. 4. (left) Accuracy difference between Naive Bayes with the transformation and 
Naive Bayes standard, (right) A histogram of domains based on the mean number of 
clusters per class. 



7 Summary and Future Work 

We propose a method to improve the probability estimates made by Naive Bayes 
by applying a clustering algorithm to each subset of class-uniform examples; the 
result is a new output space where each cluster is assigned a new class label. Our 
experimental analysis shows a significant improvement in performance when the 
class decomposition process is applied, especially when the mean number of clus- 
ters per class is large. The competitiveness of Naive Bayes reported in previous 
work is explained by the fact that many real-world datasets decompose into few 
clusters per class, a situation that favors the product distribution assumption 
followed by Naive Bayes. 

Our study assumes an effective clustering algorithm in charge of the class 
decomposition process. The choice of the clustering algorithm bears relevance to 
the effectiveness of our approach; future work will explore if our results hold for 
different clustering algorithms. In addition, we note that the parameters of the 
clustering algorithm can be adjusted based on the performance of Naive Bayes 
(e.g., by varying the number of clusters). 

Finally, since our proposed approach does not alter the algorithm design, it 
can be employed outside the boundaries of Naive Bayes, serving as a framework 
to improve the performance of classifiers that exhibit poor performance when 
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the dataset is characterized by many clusters per class, as is the case with linear 
classifiers. We plan to address this in future work; our goal is to understand how 
the class-decomposition process can serve as a general framework to improve 
classification performance. 

References 

1. Blake C.L., Merz C.J.: UCI, Repository of machine learning databases. Univer- 
sity of California, Irvine, Dept, of Information and Computer Sciences (1998). 
http://www.ics.uci.edu/~mlearn/MLRepository.html. 

2. Domingos P., Pazzani M.: On the Optimality of the Simple Bayesian Classifier 
Under Zero-One Loss. Machine Learning 29, pp. 103-130 (1997). 

3. Duda R. O., Hart P. E., Stork D. G.: Pattern Classification. John Wiley Ed. 2nd 
Edition (2001). 

4. Friedman N., Geiger D., Goldzmidt M.: Bayesian Network Classifiers. Machine 
Learning 29, pp. 131-163 (1997). 

5. Garg A., Roth D.: Understanding Probabilistic Classifiers. European Conference 
on Machine Learning, Lecture Notes in Artificial Intelligence, pp. 179-191 (2001). 

6. Holte R.C., Acker L.E., Porter B.W.: Concept Learning and the Problem of Small 
Disjuncts. Eleventh International Joint Conference on Artificial Intelligence, Mor- 
gan Kaufmann, pp. 813-818 (1989). 

7. Kohavi R., Becker B., Sommerfield D.: Improving Simple Bayes. European Con- 
ference on Machine Learning (1997). 

8. Kohavi R.: Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision- Tee Hy- 
brid. International Conference on Knowledge Discovery and Data Mining (1996). 

9. Lewis P.M.: Approximating Probability Distributions to Reduce Storage Require- 
ments. Information and Control, 2, pp. 214-225 (1959). 

10. McLachlan G., Krishnan T.: The EM Algorithm and Extensions. John Wiley and 
Sons (1997). 

11. Rish L, Hellerstein, J., Jayram, T.: An Analysis of Naive Bayes on Low-Entropy 
Distributions. IBM T.J. Watson Research Center, RC91994 (2001). 

12. Webb G. L, Pazzani M. J.: Adjusted Probability Naive Bayes Induction. Tenth 
Australian Joint Conference on Artihcial Intelligence. Springer- Verlag, pp. 285- 
295 (1998). 

13. Witten I. H., Frank E.: Data Mining: Practical Machine Learning Tools and Tech- 
niques with Java Implementations. Academic Press, London U.K. (2000). 

14. Zadrozny B., Elkan C.: Obtaining Calibrated Probability Estimates From Deci- 
sion Trees and Naive Bayesian Classihers. International Conference on Machine 
Learning (2001). 

15. Zhang H., Ling C. X.: Geometric Properties of Naive Bayes in Nominal Domains. 
European Conference on Machine Learning, pp. 588-599 (2001). 




Improving Rocchio 
with Weakly Supervised Clustering 



Romain Vinot and Frangois Yvon 



GET/ENST, 46 rue Barrault, 

75634 Paris Cedex, France 
{remain. vinot , francois . yvonjOenst .f r 



Abstract. This paper presents a novel approach for adapting the com- 
plexity of a text categorization system to the difficulty of the task. In 
this study, we adapt a simple text classifier (Rocchio), using weakly su- 
pervised clustering techniques. The idea is to identify sub-topics of the 
original classes which can help improve the categorization process. To 
this end, we propose several clustering algorithms, and report results of 
various evaluations on standard benchmark corpora such as the News- 
groups corpus. 



1 Introduction 

The automated categorization of documents [16] into predefined classes has pro- 
gressively emerged as one of the most popular task in the area of Text Mining 
technologies. The categorization paradigm provides a very general framework 
for many practical applications such as filtering, routing, indexation, tracking... 
Historically, research in automated text categorization was developped promoted 
by the Information Retrieval community, in the context of text indexation. In 
the past five to ten years, this task has been rediscovered by the Machine Learn- 
ing community and has since been the subject of many empirical studies [20], 
demonstrating the applicability and efficiency of Machine Learning algorithms 
such as Support Vector Machines [6] and Boosting techniques [14]. 

The Rocchio algorithm [12], originally proposed as a means to improve in- 
formation retrieval (IR) systems, is conceptually one of the simplest text cate- 
gorization algorithm. As such, it has been shown to be, in many experimental 
conditions, less successful than other approaches, such as SVMs or /c-NN. Var- 
ious improvements of this algorithm have been proposed, eg. in [17] [1] [15] or 
[9] , which have been effective in increasing the performance of this methodology. 

As discussed for instance in [19], Rocchio is especially well suited in contexts 
for applications where (i) the number of classes is high; (ii) the n-best answers 
(rather than just the first one) are taken into account and (iii) class labels are 
noisy. This seems to happen in practical applications, especially when categories 
cannot be directly linked with the thematic content of documents. Conversely, 
we have experimentally demonstrated that classes with heterogeneous content 
can badly impair its performance. This behaviour is in line with the simplicity 
of the underlying statistical model and of the learning procedure. 
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In this paper, we investigate several procedures allowing Rocchio to auto- 
matically adapt its complexity to the data using a weakly supervised clustering. 
The idea is to identify useful subclasses in the training data, and to use these 
to refine the decision surface; each new subclass increasing the overall model 
complexity. The main novelty of this work lies in our attempt to discover these 
subclasses in such a way that the resulting partition actually improves the final 
decision rule. 

The idea of using unsupervised clustering to improve supervised classifica- 
tion is not entirely new, and has already been suggested in several contexts. In 
the context of the fc-nearest neighbors algorithm, many authors have advocated 
unsupervised clustering as a means to organise very large instance sets, thus 
speeding up the fc-nn computation. 

In the context of the (TREC) batch filtering task, [4] and [11] incrementally 
cluster incoming data into subclasses of the original classes. These clusters are 
then used as (new) regular class labels : any document falling into cluster S of 
class C is eventually labelled as C. Both papers use rather different classification 
algorithms (Rocchio for the former paper, and SVM for the latter), but neither 
yield fully conclusive results. In any case, this optimisation does not seem to 
increase overall performance. One should note that in this approach, clustering 
and classification are viewed as two separate and unrelated processes. 

The motivations of [3] for using unsupervised clustering are clearer: these 
authors explicitely aim at structuring an heterogeneous corpus of training texts, 
in the context of a filtering task. Relevant texts are first partitionned into N 
clusters, using autoclass [2]. Test documents are then classified as follows: for 
each subclass a relevance judgement is separately computed; these judgements 
are then linearly combined, using a perceptron. This procedure seems to provide 
the authors an effective means to isolate those clusters which provide the most 
relevant judgements. 

The idea of [7] is to use clustering techniques in order to make the fc-nn 
decision rule less sensible to noisy data, and more like a linear classifier. Pro- 
ceeding bottom-up, their algorithm recursively aggregates training instances, 
subject to the condition that (i) they belong to the same class and (ii) they are 
sufficiently close in the representational space. Test documents are then classi- 
fied according to the following two-steps procedure: first compute the similarity 
with each cluster using Rocchio or the Widrow-Hoff rule; then linearly combine 
these similarities to compute the final label. This procedure provides a significant 
improvement both over a “pure” fc-nn approach and a “pure” linear decision. 

This paper is organised as follows: in Section 2 we present the Rocchio algo- 
rithm and discuss its main advantages and drawbacks. Section 3 explains how 
this baseline is improved with a weakly supervised clustering procedure. We 
describe two algorithms which aim at creating such clusters which may prove 
beneficial for the decision procedure. Section 4 reports and discusses experimen- 
tal results obtained on 3 different textual databases and Section 5 presents some 
conclusions and directions for future work. 
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2 Rocchio 



Rocchio is a text classifier originally introduced in [12] to improve information 
retrieval systems with relevance feedback. It uses the vector space model [13]: 
every text d is represented by a vector [d] in i?” (with n the number of distinct 
words in the corpus), each coordinate dyj can be computed from the frequency 
fr(w, d) of word ic in d in several manners. One of the most used formulation is : 

N 

d^ = TFIDF{w,d) = log(l + fr(w, d)) *log(— (1) 



where N is the number of documents in the corpus and N{w) the number of 
documents in which w occurs at least once. This formula allocates a higher 
weight to words which simultaneously occur frequently in the document while 
occuring only in a few documents. Each vector [d] is then normalized according 
to: dw = — in [d] so as to reduce the distortion caused by the length of 






dl. 



documents. A prototypical profile [c] is computed for each class c according to 



AT 



(2) 



d^c 



d^c 



with Nc the number of documents in c, Nc the number of documents not in c and 
t a free parameter between 0 and 1 . These profiles are defined as the centroid of 
the examples (with a positive coefficient for examples in the class and a negative 
one for the others). These vectors are also normalized. In the context of a routing 
task, t = 1 is a usual choice, meaning that in this case, negative examples do 
not contribute to the centroids. 

Classification of new documents is performed by computing the euclidian 
distance between the document vector and the prototype vector of each class; 
the document is then assigned to the nearest class. As all vectors (prototypes and 
documents) are normalized, euclidian distance is equivalent to cosine similarity 
or dot product, which is used for implementation reasons. 

In the context of this paper, Rocchio exhibits two important characteristics : 

— The decision rule computed by Rocchio is a linear separator (hyperplane) in 
the vector space. It gives Rocchio the same expressivity as a Perceptron clas- 
sifier. As [8] shows on filtering tasks, the accuracy of Rocchio with dynamic 
feedback is comparable to that of a neural network trained with a gradient 
descent algorithm. 

— The learning model of Rocchio is a generative one. Parameters are opti- 
mized to match the data, not to discriminate the different classes. Generative 
models usually have lower asymptotic performance value than discriminative 
ones, but they converge to their optimal performance faster, especially when 
there is lots of parameters to estimate (which is the case in the textual 
classification domain) [10]. 

We have shown in [19] that Rocchio is robust in the presence of noise and 
very effective for routing tasks with a high number of classes. On the other 
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hand, its inability to take into account the substructure of classes (such as 
different subtopics) is a significant drawback compared to others approaches. 
Rocchio vastly underperforms other algorithms when classes contain intermixed 
subtopics: when some subtopics of different classes are nearer than subtopics of 
the same class (for example in the case of spam filtering, the subtopic on soft- 
ware advertisement can be closer to legitimate emails than to other subtopics of 
spams such as ones with pornographic content). In the remainder of this paper, 
we will call classes with such intermixed subtopics “heterogeneous classes”. Fig- 
ure 1 shows an example of homogeneous and heterogeneous classes in terms of 
relative positioning of subtopics in the vector space. 



o o 

□ □ 



o 


□ 


□ 


o 



Homogeneous classes 



Heterogeneous classes 



Fig. 1. Homogeneous and heterogeneous corpus (round and square classes have two 
subtopics). 



These difficulties in dealing with classes with intermixed subtopics come from 
the generative model used by Rocchio. It assumes that each class has a spherical 
shape and that the information of the centroid is sufficient to correctly describe 
a class. Training is accordingly straightforward, because one only has to compute 
the centroid of all examples for each class. But in cases where the data does not 
match this model, Rocchio will perform poorly. 

To avoid these shortcomings, one need to allow Rocchio to use more complex 
models while at the same time preserving its simple learning process. This can 
be achieved with the use of a clustering algorithm which will split the classes 
into coherent sub-classes. A prototype is then computed for each cluster, as if it 
were a class in its own right. New examples are labelled according to the class of 
the nearest prototype. We call this class of algorithms Multi-Prototypes Rocchio 
(MPR). Even if this procedure can be used with any classifier, it is especially 
tailored to avoid Rocchio’s shortcomings. 

Rocchio can be seen as a neural network with no hidden layer. Weights are not 
learned by a back-propagation algorithm but simply computed as the mean of 
weights of all examples. Clusters are analog to a hidden layer having as many neu- 
rons as clusters. Weights of connection between initial layer and hidden layer are 
still computed by the mean of examples, weights between the hidden layer and 
the final layer are binary values according to the class of the cluster. Propagation 
of weights are different in MPR, because classification is based on the nearest 
prototype (the hidden neuron with the highest weight) instead of a weighted 
sum of all hidden layers. With this analogy, we see that the clustering allows to 
change the complexity of the underlying model. 
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Moreover, preliminary experiments [19] show that the use of subclasses can be 
very useful for heterogeneous classes but not for homogeneous ones. The model 
complexity (in other words number of clusters) must not be chosen according to 
the data but driven by the accuracy of Rocchio. There is no need to separate two 
subtopics of the same class if they are not mixed with any subtopic of another 
class. This suggests that clustering must be performed using a discriminative 
approach rather than a generative one. 

Most of the clustering algorithms described in introduction don’t meet these 
requirements, because the clustering process is independent of the categorization 
one. They don’t try to directly optimize the accuracy of the induced classifier 
(with the exception of the GIS algorithm from Lam). 

3 Clustering Algorithms 

We want to devise a clustering algorithm which detects clusters in heterogeneous 
classes but not those in homogeneous classes. This condition can be turned into a 
mathematical criterion in various ways. We have explored two different criteria: 
the first one is based on the relative positioning of clusters, the second one is more 
directly error-driven and based on example categorization. Both are integrated 
into a top-down hierarchical clustering (at each step, we split one cluster into 
two new smaller clusters). As both criteria uses the class labels of examples, this 
procedure is not entirely unsupervised, thus the term “weakly supervised” . 

3.1 Notations 

Let C = {ci,...,Cfe} be a set of k classes, N the number of examples and II = 
{ill, ■■■,IIp\ the unknown partition of examples. c{x) is the class of example x 
and c(ili) the class labels of the examples in Ui. Clusters are only allowed to 
split the existing classes. The partition must then verify: 



3.2 Top-Down Clustering 

Our algorithm starts with one cluster per class. At each step, we want to split the 
cluster with the largest dispersion of its documents. Following [5], we measure 
the dispersion with the square-root of the average pairwise similarity between 
documents: 



So we choose the cluster with the smallest norm and apply any clustering 
algorithm to create two subclusters. We choose K-Means for its simplicity. The 
resulting split is tested against the criterion. If the division is not accepted, we 
try with the cluster with the second smallest norm and so on until no more split 
is accepted. 



V(a:,?/) n{x) = n{y) ^ c{x) = c{y) 



( 3 ) 




( 4 ) 
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Algorithm 1 Top-down clustering with criterion 
Parameter: A criterion 9 
n initial partition with one cluster per class. 

5 = 0 

while S ^ n do 

p = argminpg^_s(||p||) 

Apply 2-means on p. Let p\ and P 2 be the two clusters, 
if 9 is verihed then {Split is accepted} 

n = n — p->rpi+p2 

5 = 0 
else 

S = S + p 

end if 
end while 



3.3 Criterion RP: With Relative Positioning of Clusters 



Vi, j c{n,) 



c{IIj) 3k, c(ilfe) c{IIi) and 



r d{IIi,nk) < 
\d(i7j,7Tfc) < d{n„n^) 



(5) 



This constraint means that for each pair of clusters of the same class, there must 
exist a third cluster of another class between the two (see figure 2 for a small 
illustration). This constraint is obviously satisfied with one cluster per class. The 
criterion we used aims at maximizing the number of clusters, while keeping the 
constraint satisfied. 




Fig. 2. Illustration of the “zone between two clusters” . 



Given the highly combinatorial nature of the problem, we have used a ap- 
proximation, which relies on a greedy algorithm. At each step of the clustering 
process, we only test if the two newly constructed clusters verify our criterion. 
The global constraint is not garanteed because we do not verify it for all the 
other pairs of clusters. 

An additional threshold controls the clusters size, preventing to built too 
small solutions, which are often noisy and unreliable : clusters are never accepted 
if they contain less than Nq examples. Without this filtering, the algorithm will 
create noisy clusters covering very few examples. These clusters will then allow 
more splits to be accepted. In our experiments, we have found that without 
filtering, the algorithm finishes with lots of very small clusters which is clearly 
not what we want. We have chosen Nq = 20 for all our experiments. 
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3.4 Criterion DC\ With Training Documents Categorization 



Min 



|a;|37Ti,i72 



s.t 



c(7Ti) = c(7T2) = c(x) 

III, II 2 nearest clusters of x 



( 6 ) 



This formula expresses the fact that we want to minimize the number of examples 
for which the two nearest prototypes are in the same class. The idea is that 
if this is true for a lots of examples, then we don’t need to distinguish those 
two prototypes. As before, this formula is obviously minimal with one cluster 
per class. Whenever a second cluster is created for any class, there is almost 
always at least one example which does not satisfy the criterion anymore. The 
constraint (6) has been relaxed as follows: new clusters are accepoted if less than 
TO * TOm(|pi|, IP 2 I) examples of the two clusters do not satisfy the constraint 
anymore, with to a free parameter. Experiments show that results are not very 
sensitive to the value of to (we have tried from to = 0.5 to to = 5 with no 
significant differences). 

Finally, we also need to point out that this constraint only take into account 
correctly classified examples, which implicitely lowers the influence of noisy ex- 
amples. This explains why there is no need to filter small clusters, filtering being 
already implicitely performed by the criterion. 



4 Experiments and Results 

4.1 Corpora 

We have used three different corpora: Newsgroups, Spam and Mail Center. 

The Newsgroups corpus, collected by Ken Lang, contains 20000 messages, 
evenly distributed in 20 classes, each class corresponding to a different Usenet 
newsgroup^. A list of Newsgroups is given in table 1. Two new corpora were 
then derived: the first one is obtained by merging newsgroups with the same 
Usenet prefix^, leading to a partition of messages into 4 homogeneous classes; 
the second one results from the merging of unrelated newsgroups into four super- 
classes, so as to have intermixed subclusters. This corpus is later referred as the 
heterogeneous corpus. 

The Spam corpus contains 2193 emails. The task is here to discriminate junk 
or unsollicited emails from the legitimate ones. The corpus contains 1460 spams 
for 733 legitimate emails. Messages in English and in French are mixed in the 
two classes. 

The Mail Center corpus contains 2393 emails received by a customer service 
classified in 40 categories (see [18] for more information regarding this corpus). 
Messages are mostly written in French. 

In all our experiments, two third of the corpus is used for training and the 
remaining part is used for testing. 

^ The corpus can be downloaded at 

http : //www-2 . cs . cmu.edu/ af s/ cs/project/theo-20/www/data/news20 .html 
^ We have placed soc.religion. Christian, alt. atheism and misc.forsale in the most sim- 
ilar classes so as to have super-classes with homogeneous sizes. 
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Table 1. List of groups for the newsgroups corpus. 



1. comp. graphics 


11. alt. atheism 


Base 1: non mixed subclusters 


2. comp.windows.x 


12. sci. electronics 


Cl 


1,2, 3,4,5 


3. comp-os. ms-windows. misc 


13. sci. crypt 


C2 


6,7,8,9,10,11 


4. comp. sys. mac. hard ware 


14. sci. space 


C3 


12,13,14,15 


5. comp. sys. ibm. pc. hardware 


15. sci.med 


C4 


16,17,18,19,20 


6. talk. politics. guns 


16. misc.forsale 






7. talk. politics. mideast 


17. rec. sport. baseball 


Base 2: Intermixed subclusters 


8. talk. politics. misc 


18. rec. sport. hockey 


Cl 


1,2,6,12,16 


9. talk. religion. misc 


19. rec. autos 


C2 


3,7,8,13,17 


10. soc. religion. Christian 


20. rec. motorcycles 


C3 


4,9,10,14,18 






C4 


5,11,15,19,20 



Table 2. Performance measure: Acc-l/Acc-2. Bold values show the best algorithm for 
each corpus. There is no Acc-2 measure for the Spam corpus since it contains only two 
classes. 





Newsgroups 


Spam 


Mail 

Center 


normal 


heterogeneous 


homogeneous 


Rocchio 

K-NN 

SVM 


0.810/0.921 

0.844^.944 

0.865/0.932 


0.754/0.921 

0.864^.970 

0.890/0.968 


0.892/0.973 

0.932/0.983 

0.959/0.990 


0.771/. 

0.930/. 

0.974/. 


0.508/0.724 

0.535/0.649 

0.578/0.713 


clustered-SVM 
K-Means 
Greedy clustering 
Criterion RP 
Criterion DC 


0.810/0.921 

0.809/0.921 

0.813/0.924 

0.818/0.922 


0.878/0.960 

0.815/0.944 

0.804/0.943 

0.818/0.942 

0.813/0.932 


0.940/0.984 

0.904/0.973 

0.908/0.975 

0.905/0.972 

0.907/0.973 


0.978A 

0.959/. 

0.948/. 

0.964/. 

0.962/. 


0.532^.698 

0.527/0.726 

0.560/0.729 

0.522/0.723 

0.562^.731 



4.2 Accuracy Measures 



Experiments have been performed for all corpora with the following algorithms: 
K nearest neighbors, Support Vector Machine, simple Rocchio, Clustered-SVM, 
MPR with A-Means, criterion-based clustering and a greedy division clustering. 
The greedy clustering always splits the cluster with the smallest norm. Clustered- 
SVM is similar to the work of [11]: after clustering, a SVM is learned for each 
cluster and new documents are assigned to the class of the best cluster. For 
algorithms where the number of clusters must be provided (such as A-Means), 
we have used the number found by criterion-based clustering. To compare algo- 
rithms, we have used a more general performance measure than usual accuracy. 
Acc-n is defined as the pourcentage of test examples for which the correct class 
is found in one of the n-best answers of the algorithm. This measure seems to 
provide a reasonable estimate of actual performance in applicative contexts in- 
volving a human validation [18]. We have shown in [19] that Rocchio performs 
comparatively much better with Acc-n (n > 1) than with Acc-1. 
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Results presented in table 2 suggest the following comments: 

— SVM constantly outperforms all other algorithms for the Acc-1 measure. 

— Clustered-SVM is always lower than simple SVM, as in [11]. This confirms 
that clustering is useful for Rocchio but not necessarily for other classifiers. 

— For the three newsgroups corpora, MPR is only slightly better than Rocchio 
and always worse than k-NN or SVM. 

— For Spam and Mail Center databases, the improvement over Rocchio is very 
sensible and MPR surpasses k-NN and even SVM for Mail Center with Acc-2. 

— MPR improves Rocchio more for Acc-1 than for Acc-2. 

— The different clustering algorithms have a very similar behaviour. The main 
advantage of our criterion-based clustering is its ability to automatically 
determine the “right” number of clusters. Overall, the DC criterion is more 
stable and accurate and is used for the rest of our experiments. 



4.3 Discussions about Number of Clusters Found by MPR 

Both implementations of MPR are also able to create useful clusters to improve 
overall accuracy. Unlike iF-Means or any unsupervised clustering, they are able 
to choose the right number of clusters. 

This fact is confirmed by an examination of the results on the Newsgroups 
homogeneous and heterogeneous corpora. For this experiments, the right number 
of clusters is known in advance: for the homogeneous corpus we expect clustering 
to be useless, whereas for the heterogeneous, we expect to get better results by 
recovering the 20 initial classes. Experimental results are reported in table 3. 
Obviously, MPR is able to see a difference between these two simulated corpora 
and to pick up an approximately correct number of clusters. 



Table 3. Number of clusters found by each algorithm. 





homogeneous 


heterogeneous 


Optimal 
Criterion RP 
Criterion DC 


4 or 20 
6.2 (0.905) 
5.4 (0.905) 


20 

22 (0.818) 
14 (0.813) 



A second way to check that the number of clusters is correct is to do an 
exhaustive search. To this end, we have used the greedy clustering with an in- 
creasing number of clusters. Each of these is then integrated in the classifier. 
Results are presented on Figure 3. Our criterion seems to always find just a 
little less clusters than the optimum value, the resulting accuracy being slightly 
lower than the maximum found by greedy clustering. Overall, we think that the 
criterion-based clustering is able to find an appropriate number of clusters. 
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Fig. 3. Accuracy with increasing numbers of clusters. The dots represent number of 
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Fig. 4. Accuracy according to the number of learning examples. 



4.4 Influence of Corpus Size 

Rocchio is very efficient and often outperforms other algorithms when the train- 
ing corpus contains very few examples per class. The clustering algorithms pre- 
sented here require an important number of examples to be able to create statis- 
tically coherent clusters. To see how the combination of clustering and Rocchio 
behaves on a small corpus, we have performed additional experiments with cor- 
pora of varying size. 

Results are reported on Figure 4. As expected, Rocchio outperforms all others 
classifiers with fery few documents but its learning curve rapidly flattens. MPR 
is a compromise between Rocchio and k-NN / SVM with better accuracy than 
k-NN or SVM with few examples and better than Rocchio with lots of examples. 

5 Conclusion and Future Work 

We have presented here a weakly supervised clustering algorithm and demon- 
strate the usefulness of this learning procedure when used in conjonction with 
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a Rocchio classifier for text categorization tasks. Unlike related work, this algo- 
rithm tries to use the supervision data (class labels) to guide clustering. We think 
this is the main reason for the difference of results between [11] and our work. 
This strategy provides the clustering algorithm with a well-behaved stopping 
criterion, which allows to automatically discover the right number of clusters. 
The criterion based on performance measures seems to be the most accurate and 
most stable one. Using this weakly supervised clustering, we have successfully 
managed to improve Rocchio’s performance, with errors rate dropping between 
4 % and 84 % depending on the corpus. These differences in performance con- 
firms that Rocchio can use some extra information regarding the internal or- 
ganization of clusters, but this information is not useful for all corpora. In our 
experiments, we found that the number of useful subclusters is relatively small, 
thus preserving the efficiency of Rocchio during the classification phase. We have 
also identified an important characteristic of our clustering algorithm: it requires 
more documents than a simple Rocchio classifier; in fact using it with a too small 
corpus can even lower accuracy. 

Our clustering algorithm allows to discover some hidden structure on any 
textual database in a way that is beneficial for the classifier Rocchio. We plan to 
investigate the use of similar imsupervised clustering techniques for other tasks 
including forgetting of past examples after a concept shift and management of 
a temporal stream of documents by monitoring these clusters. 
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Abstract. In traditional multi- instance (MI) learning, a single positive 
instance in a bag produces a positive class label. Hence, the learner knows 
how the bag’s class label depends on the labels of the instances in the bag 
and can explicitly use this information to solve the learning task. In this 
paper we investigate a generalized view of the MI problem where this 
simple assumption no longer holds. We assume that an “interaction” 
between instances in a bag determines the class label. Our two-level 
learning method for this type of problem transforms an MI bag into 
a single meta-instance that can be learned by a standard propositional 
method. The meta-instance indicates which regions in the instance space 
are covered by instances of the bag. Results on both artificial and real- 
world data show that this two-level classification approach is well suited 
for generalized MI problems. 



1 Introduction 

Multi-instance (MI) learning has received a significant amount of attention over 
the last few years. In MI learning, each training example is a bag of instances 
with a single class label, and one assumes that the instances in a bag have 
individual, but unknown, class labels. The bag’s class label depends in some 
way on the unknown classifications of its instances. The assumption of how the 
instances’ classifications determine their bag’s class label is called the multi- 
instance assumption. In existing approaches to MI learning, a bag is assumed 
to be positive if and only if it contains at least one positive instance. This 
assumption was introduced because it seemed to be adequate in the MI datasets 
used so far [1]. We refer to it as the standard MI assumption. 

In this paper we explore a generalization of the standard assumption by 
extending the process that combines labels of instances to form a bag label. In the 
standard MI case, one instance that is positive w.r.t. an underlying propositional 
concept makes a bag positive. Instead of a single underlying concept, we use a 
set of underlying concepts and require a positive bag to have a certain number 
of instances in each of them. 



N. Lavrac et al. (Eds.): ECML 2003, LNAI 2837, pp. 468-479, 2003. 
(c) Springer- Verlag Berlin Heidelberg 2003 



A Two- Level Learning Method for Generalized Multi-instance Problems 



469 



We introduce three different generalized MI concepts. In presence-based MI 
datasets, a bag is labeled positive if it contains at least one instance in each of the 
underlying concepts; a threshold-based MI dataset requires a concept-dependent 
minimum number of instances of each concept; and in a count-based MI dataset, 
the number of instances per concept is bounded by an upper as well as a lower 
limit. These three types of MI concepts form a hierarchy, i.e. presence-based C 
threshold- based C count-based. Note that the standard MI problem is a special 
case of a presence-based MI concept with just one underlying concept. Conse- 
quently any learner able to solve our generalized MI problem should perform 
well on standard MI data. 

In generalized MI problems, the learner has much less prior knowledge about 
the way the class label is determined by the instances in a bag, making this type 
of problem more difficult. We introduce the idea of two-level-classification (TLC) 
to tackle generalized MI problems. In the first step, this method constructs a 
single instance from a bag. This so-called meta-instance represents regions in the 
instance space and has an attribute for each of theses regions. Each attribute 
indicates the number of instances in the bag that can be found in the corre- 
sponding region. Together with the bag’s class label, the meta-instance can be 
passed to a standard propositional learner in order to learn the influence of the 
regions on a bag’s classification. 

This paper is structured as follows. Section 2 gives an overview over the stan- 
dard multi-instance problem and introduces notational conventions. In Section 
3, we give definitions for our three generalizations of the MI problem. The two- 
level classification method is outlined in Section 4, and experiments with the 
algorithm on artificial data and the Musk problems are described in Section 5. 
We summarize our findings in Section 6. 

2 The Standard Multi-instance Setting 

In traditional supervised learning, each learning example consists of a fixed num- 
ber of attributes and a class label. However, sometimes only a collection of in- 
stances can be labeled. For these cases, Dietterich et al. [1] introduced the notion 
of a multi-instance problem, where a “bag” of instances is given a class label. 
The motivating task was to predict whether a certain molecule is active or not. 
This is determined by its chemical binding properties, which again depend on 
the shape of the molecule. A molecule occurs in different shapes (conformations), 
because some of its internal bonds can be rotated. If at least one of the confor- 
mations of the molecule binds well to certain receptors, the molecule expresses a 
“musky” smell and is therefore considered active. In the Musk problems, a bag 
corresponds to a molecule, and the instances are its conformations. 

In this paper, we follow the notation of Gartner et al. [2]. X denotes the 
instance space, 17 is the set of class labels. In MI learning, the class is assumed 
to be binary, so 17 = {T,T}. A MI concept is a function i^mi : 2'^ — *■ 17. In the 
standard MI case, this function is defined as 



vmi{X) 3x G X : cj{x) 
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where C/ € C is a concept from a concept space C (usually called the “underlying 
concept”), and X C df is a set^ of instances. In this type of problem, a learner 
can be sure that every instance that is encountered in a negative bag is also a 
negative instance w.r.t. the underlying concept. Thus, it can focus on identifying 
positive instances using axis-parallel rectangles [1], neural networks [3], or the 
Diverse Density algorithm [4]. However, in this paper we are interested in a more 
generalized problem that leads to a harder learning task. 

3 Generalized Multi-instance Problems 

We extend the assumption of how a bag’s label is determined by the classifica- 
tions of its instances. In the standard MI case, a single instance that is positive 
in the underlying concept causes the bag label to be positive. In our case, we do 
no longer assume a single concept. Instead, a set of underlying concepts is used, 
each of which contributes to the classification. We assume that criteria based on 
the number of instances in each concept determine the bag’s class label. Below 
we introduce generalizations of the standard MI concept that are based on three 
different types of criteria. 

These more general problems cannot be solved simply by identifying positive 
instances that never occur in negative bags, because an instance in a negative 
bag can still be positive w.r.t. one of the underlying concepts. In other words, 
a positive instance does not necessarily cause a bag to be positive. Only if a 
sufficient number of instances of other concepts is present in the bag, the bag’s 
label is positive. Thus a learning method for this task must take all instances in 
the bag into account. 

To formalize our generalized view on MI learning, we redefine the MI concept 
function vmi introduced in the last paragraph. The data representation is un- 
changed, which means that we are given bags X Q X with class labels in {T, T}. 
The generalized MI concept function operates on a set of underlying concepts 
C C C. We also need a counting function Z\ : 2'^' x C — *■ N, which counts the 
members of a given concept in a bag. 

3.1 Presence-Based MI Concepts 

Our first generalization is defined in terms of the presence of instances of each 
concept in a bag. For example, an MI concept of this category is “only if instances 
of concept Ci and instances of concept C 2 are present in the bag, the class is 
positive”. Formally, a presence-based MI concept is a function i^pB- 2'^ ^ 17, 
where for a given set of concepts C cC, 

vpb{X) 44- Vc € C : c) > 1 

The following example (introduced in [5]) illustrates a presence-based MI con- 
cept. Assume we are given a set of bunches of keys. A bunch corresponds to a 

For notational convenience, we assume that all the instances in a bag are distinct. 
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Fig. 1. A multi-instance dataset with a presence-based MI concept. Instances in a 
two-dimensional instance space (attribute attO and attribute attl) are assigned to 
five bags (♦, •, O, □ and A). Two concepts (conceptO and conceptl) are given as 
circles. The instance in a circle are part of the corresponding concept. ♦ and • are 
positive bags, because they have instances in both conceptO and conceptl 



bag, the keys are its instances. The task is to predict if an unseen bunch can be 
used to open a locked door. If this door has only one lock, every bunch with at 
least one key for that lock would be positive. This corresponds to a standard MI 
learning problem. However, assume there are n different locks that need to be 
unlocked before the door can be opened. This is an example for a presence-based 
MI concept, because we need at least one instance (one key) of each of the n 
concepts (the locks) to classify a bag as positive. Thus the standard MI problem 
is a special case of this presence-based concept with |C| = 1. 

Figure ?? visualizes a presence-based MI concept in a two-dimensional in- 
stance space. Note that presence-based MI problems have been introduced before 
as “multi-tuple problems” in an ILP-based setting [6] . The instances are all part 
of the same instance space, thus the number of underlying database relations 
is 1. In the ILP definition of standard MI learning, a hypothesis consists of a 
single rule using only one tuple variable. This rule corresponds to what we call 
the underlying concept. Multi-tuple problems are the relaxation of this defini- 
tion, where an arbitrary number of rules can be used. These correspond to our 
set of concepts. Thus, because presence-based MI problems can be embedded 
into the ILP framework, they can, at least in principle, be solved by ILP learn- 
ers. Another generalization of the MI problem called “multi-part problem” has 
been introduced in an ILP-based setting [5] . In this type of problem no explicit 
assumption is made about how the instances in a bag contribute to the bag’s 
classification. 

3.2 Threshold-Based MI Concepts 

Instead of the mere presence of certain concepts in bag, one can require a cer- 
tain number of instances of each concept to be present simultaneously. If for 
each concept, the corresponding number of instances of that concept exceeds a 
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given threshold (which can be different for each concept), the bag is labeled pos- 
itive and negative otherwise. Formally, we define a threshold-based MI concept 
function vtb as 

vtb{X) Vc, S C : A{X, c) > ti 

where € N is called the “lower threshold for concept i” . An extension of the 
“key-and-lock” example mentioned above illustrates a count-based MI concept. 
Assume the door has more than just one lock of each type, and that the keys 
have to be used simultaneously in order to open the door, i.e. we need one key 
for each lock. Here, we need at least as many keys for each type of lock as there 
are locks of this type in the door. In a such a dataset, each positive bag (a bunch 
of keys) has to have a minimum number of keys (the instances) of each type of 
lock (the concepts). 



3.3 Count-Based MI Concepts 

The most general concepts in our hierarchy are count-based MI concepts. These 
require a maximum as well as a minimum number of instances of a certain 
concept in a bag. This can be formalized as 

vcb{X) <t4> Vci G C : ti < A{X, c) < Zi 

where G N is again the lower threshold and 2 ^ G N is called the “upper 
threshold for concept z” . Imagine the following learning example for this type of 
problem. We are given daily statistics of the orders processed by an company. 
An order is usually assigned to one the company’s departments. We want to 
predict if the company’s workload is within an optimal range, where none of the 
departments processes too few orders (because its efficiency would be low), or 
gets too many orders (because it would be overloaded) . In a MI representation of 
this problem, bags are collections of orders (the instances) of a certain day, and 
the class label indicates whether the company was within an effective workload 
on that day. Each of the underlying concepts C assigns an order to a department. 
In order for the company to perform efficiently on a certain day, each of the 
departments has to work within its optimal range, i.e. the number of instances 
of this concept must be bounded from below and above. These bounds can be 
different for each concept. 



4 Two-Level Classification 

Although ILP-based learners can be used for presence-based MI problems, there 
appears to be no method capable of dealing with threshold- and count-based 
MI concepts. In particular, learners relying on the standard MI assumption are 
doomed to fail, because they aim at identifying positive instances that are not 
present in a negative bag. In our generalized view, this is usually not the case. 

Taking a closer look at the way MI data are created, there are two functions 
that determine a bag’s class label. First, there is a function that assigns an 
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instance in a bag to a concept, a mono-instance concept function. Second, there 
is the MI concept function that computes a class label from the instances in 
a bag, given their concept membership by the first function. Thus, a two-level 
approach to learning seems appropriate. At the first level, we try to learn the 
structure of the instance space X, and at the second level we try to discover the 
interaction that leads to a bag’s class label. Let us discuss the second level first. 



4.1 Second Level: Exploring Interactions 

Assume we are given the concept membership functions Ci for all concepts Ci G 
C. Then we can transform a bag X into a meta-instance with \C\ numerical 
attributes whose values indicate the number of instances in the bag that are 
members of the respective concept. Assigning the bag’s class label to this newly 
created instance, we end up with a single instance appropriate for propositional 
learning. Processing all the MI bags in this way results in a propositional dataset 
that can be used as input for a propositional learner. However, we are not given 
the concept membership functions, and inducing each a directly is not possible, 
because we neither know the number of concepts that are used, nor the instances’ 
individual class labels. Instead, we are only given a class label for the bag as a 
whole. However, it turns out that a decomposition of the instance space into 
“candidate” regions for each concept is possible, enabling the learner to recover 
the true MI concept at the second level. This is described in the next section. 



4.2 First Level: Structuring the Instance Space 

In the first level of our classification procedure, we construct a single instance 
from an MI bag. The attributes in this instance represent regions of the instance 
space, and an attribute’s value is simply the number of instances in the bag that 
pertain to the corresponding region. One possible approach for identifying the 
regions would be clustering. However, this discards information given by the bag 
label. If we label each instance with its bag’s label, regions with a high observed 
proportion of positive instances will be candidate components for concepts (see 
Figure ??). Hence, the “clustering” method should be sensitive to changes in the 
class distribution. 

Consequently, we use a standard decision tree for imposing a structure on 
the instance space because it is able to detect these changes. The decision tree 
is built on the set of all instances contained in all bags, labeled with their bag’s 
class label. The weight of each instance in a bag X is set to where N 

denotes the sum of all the bag sizes and b the number of bags in the dataset. This 
gives bags of different size the same weight and makes the total weight equal to 
the number of instances. Information gain is used as the test selection measure. 
A node in the tree is not split further when the weight of its instances sums 
up to less than 2, and no other form of pruning is used. A unique identifier is 
assigned to each node of the tree. Using this tree, we convert a bag into a single 
instance with one numerical attribute for each node in the tree. Each attribute 
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Fig. 2. Constructing a single instance from a bag with three instances and three at- 
tributes al, a2 and a3. A decision tree with five nodes is used to identify regions in the 
instance space. The constructed single instance (right) has an attribute for each node 
in the tree, that counts the number of instances in the bag pertaining to that node 



counts how many instances in the bag are assigned to the corresponding node 
in the tree. See Figure 2 for an illustration. 

4.3 Attribute Selection 

Attribute selection (AS) can be applied to refine the classifier. If the decision 
tree is not able to find the region representing a concept, classification at the 
meta-level will fail. Attributes that do not contribute to the classification of 
individual instances could be picked as splitting attributes in the tree and thus 
cause an incorrect representation of the concept regions. Attribute selection tries 
to eliminate these attributes. In our experiments, we used the method proposed 
in [7], which evaluates a given subset of attributes by cross-validations on this 
subset. Note that the cross-validation is performed at the bag level using both 
levels of TLC. We used backward selection, which starts with all attributes and 
subsequently eliminates attributes that worsen the performance. Of course, this 
method is computationally expensive and increases the runtime of the two-level 
classifier considerably. 

5 Experiments 

We have evaluated the performance of TLC using both artificial and real-world 
data. The only publicly available real-world data stems from the Musk prob- 
lem [1] , and for this problem it is very likely that methods able to deal with the 
standard MI assumption are sufficient. We therefore focus on artificially created 
datasets, where the performance on different types of MI concepts can be shown, 
and where we know that the various properties of our three types of problem 
hold. 

Assuming that the first-level classifier can identify concepts over the instance 
space properly, the second-level learner must be able to learn intervals in the 
meta-attributes that are responsible for a positive classification. In our exper- 
iments, we used Logit-boosted decision stumps [8] with 10 boosting iterations. 
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because they are able to do so, are well-known to be accurate general-purpose 
classifiers and performed well in initial experiments. 

We compare the results of TLC both with and without attribute selection to 
the Diverse Density (DD) algorithm [4] and the MI Support Vector Machine [2]. 
Both have been designed for standard MI problems, although the latter does not 
exploit the MI assumption in any way. In the DD algorithm, we used an initial 
scaling parameter of 1.0. The MI Support Vector Machine was based on an RBF 
kernel, a 7-parameter of 0.6 and a C-parameter of 1.0. 



5.1 Artificial Datasets 



For the artificial datasets in our experiments we used bitstrings of different length 
I as instances, i.e. X = {0,l}b The length I of the bitstring is the sum of 
the number of relevant attributes 4 and the number of irrelevant attributes li. 
Concepts c are bitstrings in {0, 1}^'’, i.e. an instance is member of a concept Ci 
if and only if its first Ir attributes match the bit pattern c^. 

The construction of an artificial dataset of a certain type (either presence-, 
threshold-, or count-based) works as follows. We randomly select different bit- 
strings of length Ir as concepts. A positive bag is first filled with instances of the 
different concepts according to the type of data we want to create: for presence- 
based MI concepts, at least one instance of each concept is added to the positive 
bag; for a threshold-based MI concept at least ti instances of each concept Ci 
are added; and for a count-based MI concept, we add at least ti and at most 
Zi instances of concept c^. Since we choose different bitstrings representing the 
concepts, instances cannot be member of two concepts at the same time. In a 
second step, a number of random instances that are not member of any concept 
are added to the bag. These are created by uniformly drawing instances from 
the instance space {0, 1}* and ensuring that they are not member of any of the 
concepts. These “irrelevant” instances are designed to make learning problem 
harder and more realistic. 

Negative bags are constructed by first creating a positive bag in the way 
described above and then converting it into a negative bag. Since a negative bag 
must not satisfy the used MI concept, we need to negate at least one condition 
imposed on one of the concepts C. For a presence-based concept, we need to 
remove all the instances of at least one concept Ci, for a threshold-based concept 
the number of instances of at least one concept Ci must be less than ti, and for 
a count-based concept, the number of instances of at least one concept must be 
increased or decreased so that it is either less than ti or greater that Zi. Every 
possible subset of C except the empty set can be negated to create a negative 
bag. We choose uniformly from these — 1 possibilities. Increasing/decreasing 
the number of instances of a concept in bag can be done by replacing random 
instances by instances of the respective concept or replacing some of the concept’s 
instances by random instances. After negation, the bag has the same size as 
before, thus the average bag size of positive and negative bags is the same. 
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Table 1. Results for presence-based MI data using only one underlying concept 





DD 


MI SVM 


TLC without AS 


TLC with AS 


1-5-0 


100 ± 0 


94.35 ± 0.74 


100 ± 


0 


100 ± 


0 


1-5-5 


100 ± 0 


92.26 ± 0.95 


100 ± 


0 


100 ± 


0 


1-5-10 


100 ± 0 


90.74 ± 0.76 


100 ± 


0 


100 ± 


0 


1-10-0 


99.57 ± 0.59 


96.20 ± 0.85 


97.46 ± 


0.92 


97.07 ± 


0.3 


1-10-5 


99.41 ± 0.54 


94.67 ± 1.35 


97.57 ± 


0.87 


96.11 ± 


1.57 


1-10-10 


99.8 ± 0.44 


91.66 ± 2.3 


97.85 ± 


0.89 


94.34 ± 


2.66 



Table 2. Results for presence-based MI data using two or three underlying concepts 





MI SVM 


TLC without AS 


TLC with AS 


2-5-0 


80.96 ± 1.9 


100 ± 


0 


100 ± 


0 


2-5-5 


81.17 ± 1.79 


88.38 ± 


11.91 


100 ± 


0 


2-5-10 


79.21 ± 1.66 


78.64 ± 


13.15 


100 ± 


0 


2-10-0 


84.18 ± 0.52 


99.01 ± 


1.32 


97.56 ± 


1.38 


2-10-5 


82.0 ± 1.53 


85.18 ± 


10.07 


96.67 ± 


1.58 


2-10-10 


80.74 ± 0.79 


86.63 ± 


8.69 


94.45 ± 


4.54 


3-5-0 


82.0 ± 2.13 


100 ± 


0 


100 ± 


0 


3-5-5 


82.12 ± 0.98 


81.93 ± 


2.9 


99.98 ± 


0.04 


3-5-10 


81.43 ± 0.96 


86.32 ± 


6.48 


98.49 ± 


1.74 


3-10-0 


84.39 ± 1.25 


95.68 ± 


3.78 


94.73 ± 


2.91 


3-10-5 


84.27 ± 1.44 


78.07 ± 


0.91 


87.41 ± 


6.24 



Presence-based MI Datasets. We created presence-based datasets with concepts 
of length 5 and 10, hence Ij. = 5 or Ir = 10, respectively. Some of our datasets re- 
quired two concepts to be present in the positive bags (|C| = 2), in some we used 
three concepts (ICI = 3). To generate a positive bag, the number of instances in 
a concept was chosen randomly from {1, ..., 10} for each concept. The number of 
random instances was selected with equal probability from {10|C|, ..., 10|C|-|-10}. 
Hence the minimal bag size in this dataset was \C\ -\- 10|C| and the maximal bag 
size 20\C\ 10. We trained the classifiers on five different training sets with 50 

positive and 50 negative bags each. Tables 1 and 2 show the average accuracy on 
a test set with 5,000 positive and 5,000 negative bags and the standard deviation 
of the five runs. The parameters of the dataset are given as (IC'D — (C) — {h), 
e.g. 3-10-5 has three concepts, 10 relevant and 5 irrelevant attributes. 

The results in Table 1 show that each of the MI learners can deal well with 
presence-based MI concepts using only one underlying concept (corresponding 
to the standard MI assumption) . DD cannot deal with more than one underlying 
concept and its performance is not competitive, therefore it is not included in 
Table 2. The MI SVM does not make explicit use of the standard MI assump- 
tion and surprisingly, its similarity measure enables it to perform well on these 
presence-based datasets (Table 2). The TLC method discovers the underlying 
MI concept perfectly in most cases, if no irrelevant attributes are used. Irrelevant 
attributes make it hard for the decision tree to represent the true underlying con- 
cepts, which in turn worsens the performance of the classifier (e.g. for dataset 
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Table 3. Results for threshold-based MI data 





MI 


SVM 


TLC 


without AS 


TLC 


with AS 


42- 


5-0 


84.35 


± 


3.07 


100 


± 


0 


100 


± 


0 


42- 


5-5 


81.54 


± 


2.24 


95.93 


± 


9.1 


100 


± 


0 


42- 


5-10 


81.59 


± 


0.4 


84.67 


± 


14.31 


100 


± 


0 


42- 


10-0 


86.28 


± 


1.33 


99.35 


± 


0.45 


97.02 


± 


1.65 


42- 


10-5 


85.36 


± 


0.92 


88.65 


± 


10.12 


96.58 


± 


1.91 


42- 


10-10 


83.93 


± 


0.36 


84.59 


± 


8.08 


95.89 


± 


2.05 


275 


-5-0 


84.75 


± 


1.03 


97.2 


± 


2.78 


97.2 


± 


2.78 


275 


-5-5 


83.9 


± 


1.29 


90.62 


± 


6.57 


97.94 


± 


2.83 


275 


-5-10 


82.73 


± 


0.85 


86.42 


± 


5.39 


93.75 


± 


6.63 


275 


-10-0 


88.66 


± 


1.12 


95.44 


± 


1.21 


97.68 


± 


1.57 


275 


-10-5 


87.05 


± 


0.75 


86.92 


± 


6.56 


90.44 


± 


4.63 



2-5-10). However, the attribute selection method eliminates the irrelevant at- 
tributes, enabling the classifier to give good results even on datasets with a high 
ratio of irrelevant attributes (dataset 2-5-10 and 3-5-5). 

Threshold-based MI Datasets. We created threshold-based datasets using = 5 
or Ir = 10 and two or three concepts. We chose thresholds = 4 and t2 = 2 for 
six datasets and ti = 2, t 2 = 7 and fa = 5 for another five datasets. For positive 
bags, the number of instances of concept Ci was chosen randomly from {ti, ..., 10}. 
To form a negative bag, we replaced at least (A(X,Ci) — ti — 1) instances of a 
concept Ci in a positive bag X by random instances. The minimal bag size in this 
dataset is ti-\-10\C\, the maximal size is 201(71-1-10. Table 3 shows the results. 
The parameters of the dataset are given as — (Ir) — (h)- For example, 

42-10-0 has at least 4 instances of the first concept and 2 instances of the second 
concept in a positive bag, with 10 relevant and 0 irrelevant attributes. 

Even though threshold-based MI concepts are harder to learn than presence- 
based ones, the results for the MI SVM show that it can deal quite well with 
these datasets. However, TLC achieves better results, although the variance of 
the results can be high (datasets 42-5-10 and 275-5-5) if no attribute selection 
is performed. We did not apply attribute selection in conjunction with the MI 
SVM, because Table 3 shows that its performance is not greatly affected by 
irrelevant attributes. 

Count-based MI Datasets. Our count-based MI datasets are based on 5 or 10 
relevant attributes. We used the same value for both thresholds ti and Zi, because 
we considered this an interesting special case. In the following, we refer to this 
value as Zi. Hence, the number of instances of concept Ci is exactly Zi in a 
positive bag. For six datasets, we set zi = 4 and Z 2 = 4, and for five other 
datasets, Zi = 2, Z2 = 7 and Z3 = 5. A negative bag can be created by either 
increasing or decreasing the required number Zi of instances for a particular c^. 
We chose a new number from {0, ..., Zi-l}U{zi-\-l, ..., 10} with equal probability. 
If this number was less than Zj, we replaced instances of concept Ci by random 
instances, if it was greater, we replaced random instances by instances of concept 
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Table 4. Results for count-based MI data 





MI 


SVM 


TLC 


without AS 


TLC 


with AS 


42 - 


5-0 


52.78 


± 


2 


99.55 


± 


0.64 


99.35 


± 


0.92 


42 - 


5-5 


52.7 


± 


1.04 


57.89 


± 


11.09 


85.22 


± 


21.65 


42 - 


5-10 


53.83 


± 


1.46 


57.63 


± 


7.64 


70.9 


± 


25.94 


42 - 


10-0 


55.21 


± 


1.76 


90.89 


± 


6.25 


92.76 


± 


1.64 


42 - 


10-5 


54.62 


± 


0.5 


57.8 


± 


8.55 


92.78 


± 


2.45 


42 - 


10-10 


55.59 


± 


2.81 


51.05 


± 


1.6 


65.1 


± 


20.35 


275 


- 5-0 


54.31 


± 


2.07 


95.15 


± 


2.4 


95.15 


± 


2.4 


275 


- 5-5 


51.6 


± 


0.45 


55.2 


± 


6.13 


83.87 


± 


17.67 


275 


- 5-10 


52.34 


± 


0.5 


50.33 


± 


0.72 


56.94 


± 


11.64 


275 


- 10-0 


54.52 


± 


1.54 


87.85 


± 


4.26 


89.86 


± 


3.4 


275 


- 10-5 


54.5 


± 


1.81 


54.11 


± 


4.79 


83.5 


± 


17.88 



Table 5. Results for musk-datasets 





DD 


MI SVM 


TLC without AS 


muskl 


88.9 


86.4 ± 1.1 


88.69 ± 


1.64 


musk2 


82.5 


88.0 ± 1.0 


83.13 ± 


3.23 



Ci. The minimal bag size in this dataset is + 10|C'|, the maximal possible 

bag size is 20ICI -I- 10. Accuracy results are given in Table 4. The parameters 
of the dataset are given as {z\..Zn) — (Ir) — {k)- For example dataset 42-5-5 
requires exactly 4 instances of the first concept and 2 instances of the second 
concept in a positive bag, using 5 relevant and 5 irrelevant attributes. 

The results for the count-based MI data (Table 4) show that the TLC method 
is able to learn this type of concept, even if a reasonable number of irrelevant 
attributes is involved. In datasets with a very high ratio of irrelevant attributes 
(dataset 275-5-10), even TLC fails, because the underlying concepts cannot be 
identified accurately enough. Attribute selection improves the performance of 
TLC, but only in some of the five runs, which leads to high variance. The results 
show that the MI SVM cannot learn count-based MI concepts; its performance 
is only slightly better than the default accuracy. 



5.2 Musk Datasets 

We have also evaluated the performance of TLC on the Musk datasets used by 
Dietterich et al. [1]. As described above, we used a boosted decision stump with 
10 boosting iterations at the second level. At the first level, we used a stan- 
dard pruned C4.5 tree [9]. However, performance improved after equal- width 
discretization based on ten intervals, representing the split points as binary at- 
tributes [10], and our results are based on the discretized data. We performed 
10 runs of 10-fold cross-validation and give their average accuracy and standard 
deviation in Table 5. The results for the MI SVM are an average value of 1000 
runs of randomly leaving out 10 bags and training the classifier on the remaining 
ones [2], using a y-parameter of 10“®'^. The results for the DD algorithm were 
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achieved by twenty runs of a 10-fold cross-validation [4]. Table 5 shows that TLC 
can successfully be applied to the Musk data. 

6 Conclusions and Further Work 

We have presented three different generalizations of the multi-instance assump- 
tion. Two-level classification is an elegant way to tackle these learning problems, 
and as our results show, it performs well on artificial data representing all three 
types of problems. A simple form of attribute selection increases the performance 
considerably in cases where a high ratio of irrelevant attributes makes it hard 
to discover the underlying instance-level concepts. On the Musk data, our al- 
gorithm performs comparably with state-of-the-art methods for this problem. 
Further work includes the application of a different clustering technique at the 
first level to structure the instance space and the application of our method to 
an image classification task. 
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Abstract. Cluster analysis is a fundamental technique in pattern recog- 
nition. It is difficult to cluster data on complex data sets. This paper 
presents a new algorithm for clustering. There are three key ideas in 
the algorithm: using mutual neighborhood graphs to discover knowledge 
and cluster data; using eigenvalues of local covariance matrixes to express 
knowledge and form a knowledge embedded space; and using a denoising 
trick in knowledge embedded space to implement clustering. Essentially, 
it learns a new distance metric by knowledge embedding and makes clus- 
tering become easier under this distance metric. The experiment results 
show that the algorithm can construct a quality neighborhood graph 
from a complex and noisy data set and well solve clustering problems. 



1 Introduction 

Cluster analysis is the automatic identification of groups of similar objects. The 
discovered clusters serve as the foundation for other data mining and analysis 
techniques. There have been many works on cluster analysis. Existing clustering 
algorithms, such as K-means [1], PAM [2], CLARANS [3], DBSCAN [4], CURE 
[5], and ROCK [6] are designed to find clusters that fit some static models. 
These algorithms will breakdown if the model is not adequate to capture the 
characteristics of clusters. Most of these algorithms breakdown when the data 
set consists of clusters that are of different shapes, densities and sizes [7]. 

This paper presents a new clustering algorithm. There are three key ideas 
in the algorithm. The first is using mutual neighborhood graphs to discover 
knowledge and cluster data. The second is using eigenvalues of local covariance 
matrixes to express knowledge and embedding knowledge into the input space to 
form a knowledge embedded space. MNN (Mutual Nearest Neighbor) distance 
in the knowledge embedded space is used as the new distance metric instead of 
Euclidean distance in the input space. The third is using a denoising trick in 
knowledge embedded space to implement clustering. Essentially, it learns a new 
distance metric by knowledge embedding and makes clustering become easier 
under this distance metric. The experiment results show that the algorithm can 
cluster data of different shapes, densities and sizes correctly. 

The rest of this paper is organized as follows: Section two gives basic notions 
and an overview of related work. Section three describes the new method in 
detail. Section four explains several experiments using the new method. Section 
five gives some discussions. Section six presents conclusions. 
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2 Basic Notions and Related Work 

The basic ideas of our algorithm are elicited by LLE [8,9]. LLE showed us an 
efficient way to use local information to solve nonlinear problems. First, it con- 
structs a neighborhood graph. Next, it discovers the information of reconstruc- 
tion weights from the neighborhood graph. Finally, it carries out nonlinear di- 
mensionality reduction by using the reconstruction weights. Our algorithm is 
similar to LLE except it solves clustering problems. First, a mutual neighbor- 
hood graph is constructed. For ideal data sets, all points belonging to the same 
cluster are connected in the neighborhood graph. Any points belonging to dif- 
ferent clusters are not connected. However, in practice, due to noise and the 
complexity of the data set, different clusters are often connected. We must split 
them from each other. Then, local information useful for clustering is discovered 
and embedded into input space and a knowledge embedded space is formed. Fi- 
nally, clustering can be done by the use of this information and different clusters 
can be split from each other. 

The local information used in our algorithm is eigenvalues of local covariance 
matrixes. This is an extension of local principal component analysis methods 
[10,11,12]. Our method directly uses all the eigenvalues of local covariance ma- 
trixes to represent local knowledge rather than only analyzing local principal 
components. The advantage is that it contains all the information about local 
shape and local size rather than local dimension. In section 3.4 and 3.5, we will 
see that eigenvalues are useful for denoising of input data and A knowledge that 
are pivotal steps for clustering. 

The Kernel-based method [13,14,15] is a typical solution for nonlinear prob- 
lems. The key idea of which is to transform nonlinear data sets in the input space 
into linear data sets in a high dimensional feature space. Essentially, it is still 
finding a new distance metric by space transformation. The primary difficulty 
of kernel-based methods is that it is difficult to choose a proper kernel function 
to perform this task. In our method, we also construct a high dimension space, 
an easily formed knowledge embedded space. The purpose is to analyze useful 
information for clustering, not translate nonlinear data sets into linear data sets. 

3 NK Algorithm 

Our algorithm is called NK algorithm. It means mutual neighborhood graph 
construction by knowledge embedding. ”N” indicates neighborhood and ”K” 
indicates Knowledge. This section describes the NK algorithm in detail. First, 
some basic concepts are defined, and then the details of the algorithm are given. 

3.1 Definitions 

Here are the basic concepts used in NK algorithm which will be explained in the 
next six subsections. The input of the algorithm is a data set X, with N points: 

A = {xi,...,XN},Xi G R'^ 



( 1 ) 
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N is the number of points and d is the dimension number of input data. 

Definition 1: Set oji = . . . ,Xi^} {i = is called the local 

neighborhood of Xi, where K is the number of neighbors and x^^ K) 

denotes the nearest neighbor of Xi. For convenience, is used to denote Xi 
and LOi is rewritten as uji = {xi ^, ..., Xij^}. 

Definition 2: Local covariance matrix of uJi is: 

S^ = {x,, - mi) {x^ - mY , ( 2 ) 

^ 1=0 



K 

where ^ is average of 

Definition 3: Ai = [Xu , . . . , Xid]^ {i = A*i > ••• > Xid) is the 

vector of eigenvalues of Si and is called the local feature of Xi. The knowledge 
represented by local feature is called A knowledge. 

Definition 4: A neighborhood graph is an undirected weighted graph G = 
(X,E), where X is the set of data points and E is the set of edges between 
neighbors with weights Cij to represent the distance. When MNV (Mutual Neigh- 
borhood Values) [16] are used as the weights, the neighborhood graph is called 
a mutual neighborhood graph and the distance represented by MNV is called 
MNN distance. If Xi and Xj are not neighbors, let = 0, indicating there is no 
edge between them; otherwise, is the MNV of Xi and Xj: 



d^ij j UJi 

0,Xj ^ oji 



( 3 ) 



where Lij is the mutual neighborhood value of the pair of points Xi and Xj. If 
Xj is the nearest neighbor of Xi and Xi is the nearest neighbor of Xj, 
then Lij = p + q — 2, p,q = 1, K [16]. 

Definition 5: Inadaptability of uJi is defined as follows: 



ai = 



1 \ - X.jj 

d ^ Kj ’ 



( 4 ) 



where A^ is the element of A^, Xij = ^ ^ Xtj, A* {t = ii,. . . Gk) 

is the vector of eigenvalues of St- ii {I = 1, . . . , K) is the subscript of £Ci, which 
is the nearest neighbor of Xi. 



3.2 Mntual Neighborhood Graph Construction 

The first step is mutual neighborhood graph construction: Calculate the Eu- 
clidean distance of each pair of Xi and Xj; Calculate the mutual neighborhood 
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value L = {L^}; Find the K nearest neighbors for each point Xi according to 
mutual distance Lij. 

This is a K-NN method which is suitable for our algorithm, when K is fixed, 
the eigenvalues of a local covariance matrix represent the distribution of the local 
data, which helps us to learn knowledge from the data. Furthermore, mutual 
nearest neighbor distance is used instead of Euclidean distance. As we know, 
MNN distance contains important knowledge for nonlinear problems and makes 
it easier for the points with similar densities to cluster together. We regard local 
density as important knowledge for clustering, so MNN distance is more suitable 
here than traditional Euclidean distance. 



3.3 Distance Metric Learning by Knowledge Embedding 



The second step is to discover knowledge from the mutual neighborhood graph. 
After mutual neighborhood graph construction, the neighborhood of each data 
point is identified. As a result, local covariance matrix Si and its eigenvalues 
can be computed. Then we get local knowledge for each point and the knowledge 

X ' 

embedded space is formed by combine Xi with A^: , where y^ is the 

corresponding point of Xi in knowledge embedded space. In the next subsections 
two steps of denoising are performed which are the pivotal steps in NK algorithm. 

The distance metric used in our algorithm is MNN distance in the knowledge 
embedded space rather than Euclidean distance in input space. If two points are 
close in this distance metric, it means that in the input space, their coordinates 
are close and local features are similar. How this metric is used for clustering 
will be discussed in section 5. 



3.4 Denoising of Input Data 

When data set has some background noise, clustering becomes difficult. We 
should remove background noise from data set. By the use of A knowledge, this 
task can be done easily. As we know, background noise is usually very sparse. 
So its eigenvalues of local covariance matrix are much larger than other points. 
Then, background noise can be removed by using a threshold E{\i) + Pnoise * 
D{Xi), where E{\i) and D{\i) are the mean and variance of Xi respectively, 
and Pnoise is a parameter. How to choose Pnoise will be discussed in section 5. 

3.5 Denoising of A Knowledge 

In neighborhood graphs, there are often some edges connecting two points that 
are not actually neighbors. These are called false edges. In Fig. 1, the edges 
between layers are false edges. Where there is a false edge, the corresponding 
eigenvalues of the local covariance matrix cannot represent the correct local fea- 
ture of the neighborhood. We consider this as some kind of noise of A knowledge. 
An efficient algorithm must be used to remove the false edges and this is called 
the denoising of A knowledge. 
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Inadaptability Oi is defined for false edge finding. In Fig. 2, Xi is a point with 
false edges, uji is the corresponding neighborhood. Xj is a neighbor of Xi who 
has no false edges and ujj is Xj’s neighborhood. and are corresponding 
eigenvalues. Obviously, the elements of A^ tend to be larger than Xj’s. So neigh- 
borhoods with false edges can be found by comparing their eigenvalues. There 
may be clusters with different densities and eigenvalues in a sparse cluster will 
be larger than those in a dense cluster. So we compare eigenvalues only between 
neighbors and define inadaptability as definition 5, where Xij is the element 
of Xi, Xij is the mean of the eigenvalue of A^, (I = 1,. . . ,K). So the similarity 
between uji and Wj, can be measured by a^. If uJi and Wi, are similar, Ui is small; 
otherwise Oi is large. Then LVi with false edges whose corresponding is larger 
than a threshold is selected. E{ai) + Pfaise*D{ai) is used as the threshold, where 
E{ai) and D{ai) are the mean and variance of (i = 1, . . . ,7V) and Pfaise is 
a parameter. How to choose Pfaise will be discussed in section 5. 

Each LOi with false edges is denoised by a steepest descent method. First, 

K 

calculate the mean of the neighborhood rrii = -j^TTf 'Yh • Then, calculate the 

1=0 

distance between and rrii. Next, remove the point with the maximal distance 
from LUi and repeat this procedure until the inadaptability on the rest points is 
smaller than the threshold E{ai) + Pfaise * D{ai). 

After denoising, a well constructed mutual neighborhood graph is obtained. 
It is called a denoised mutual neighborhood graph. 

3.6 Clustering in Knowledge Embedded Space 

In the last step, we cluster data in knowledge embedded space. Start from ar- 
bitrary node Xi, find its K nearest neighbors, Xi^, . . . ,Xi^^. Then find the K 
nearest neighbors of each x^, etc. Combining all these points together results 
in a cluster. All the clusters can be obtained in the same way. It is obvious that 
the number of clusters is determined automatically. 

3.7 Summarization of the Algorithm 

The algorithm is presented in detail in Table 1. 

4 Experiments 

4.1 Clustering 

Fig. 1-4 are experiment results of a two-layer Swiss roll which is a typical non- 
linear problem. In the first step, there were many false edges between layers. 
All the points will be clustered into one cluster, see Fig. 1. Our algorithm can 
cluster all the points correctly, because it uses a denoising trick to remove the 
false edges, see Fig. 4. 

Fig. 5-8 are experiments on data sets of many clusters with different shapes, 
densities, sizes and also with some background noise. The data sets comes from 
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Table 1. NK algorithm 



Step 1: Mutual neighborhood graph construction 

— Calculate the Euclidean distance matrix D = {dij 

— Calculate the mutual neighborhood value matrix, L — {Lij}. 

— Find the K nearest neighbors for each point Xi under MNN distance. 

Step 2: Knowledge embedding 

— Calculate local covariance matrix Si of each point Xi. 

— Calculate eigenvalues Xi of Si. 

Step 3: Denoising of input data. 

— Calculate the threshold E{Xi) + P„oise * D{Xi) and remove noise points. 

— Construct a new mutual neighborhood graph on the denoised input data. 

Step 4: Denoising of A knowledge. 

— Recalculate eigenvalues A; of each point Xi. 

— Calculate inadaptability ai of each point Xi. 

— Calculate the threshold E{ai) + Pfaise * D{ai) and select out u>i with false 
edges whose corresponding Oi is larger than the threshold E{ai) + Pfaise * 
D{ai). 

— Denoising oJi with false edges by a steepest descent method: 

K 

a) calculate the mean of the neighborhood rrii = 

b) calculate the distance d\ between a;^, and rrii,Xi^ is the neighbor of 

Xi. 

c) Find Xi^ whose corresponding df is the maximum in all d\,l = 1, . . . , K . 
Remove the point Xi^ from uii and remove the point Xi from u>i, . It means 
to break the edges between Xi and Xi^ 

d) Repeat c) until the inadaptability on the rest points is smaller than the 
threshold. 

Step 5: Clustering in knowledge embedded space. 



[17]. All the points can be correctly clustered and background noise is removed. 
To keep the figure legible, only parts of the points are shown and background 
noise is not shown. 



4.2 Enhanced Isomap 

The algorithm was also used to construct the neighborhood graph for ISOMAP 
[18]. For complex data sets, there are often some false edges in the neighborhood 
graph which prevent ISOMAP from reducing dimensionality correctly. After re- 
moving these false edges, ISOMAP was able to find the structure of complex 
data sets much more accurately (see Fig. 9). 
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Fig. 1. Mutual neighborhood graph of a Fig. 2. The rectangle area in Fig. 1 
two-layer Swiss roll of 2000 points 




Fig. 3. Inadaptability of all points. The Fig. 4. Mutual neighborhood graph after 
horizontal line is the threshold denoising of A knowledge. K = 10, Pnoise = 

8 , Pfalse = 2 






Fig. 12. Eps — 0.5 Fig. 13. Eps = 0.4 



4.3 Compare with DBSCAN 

DBSCAN [4] is a well-known spatial clustering algorithm that has been shown 
to find clusters of arbitrary shapes. We have done some experiments to compare 
NK algorithm and DBSCAN. 

In most of our experiments, DBSCAN works very well, but it fails to perform 
well in some cases while NK algorithm can work well. Fig. 10-13 are some results 
of DBSACN. Following the recommendation of [4], the MinPts was fixed to 4 
and Eps was changed in these experiments. Fig. 10 is the best result of DBSCAN 
with Eps = 0.772 and there are 11 clusters. If Eps is increased to 0.773, some 
part of Swiss roll of different layers will be clustering into the same clustering, 
see Fig. 11. In Fig. 12 and Fig. 13 the data set contains clusters of different 
densities and the figures illustrate that DBSCAN cannot effectively find clusters 
of different densities [7,19] while NK algorithm works well on the same data set. 
To keep the figure legible, only parts of the points are shown, and background 
noise is not shown. 
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4.4 Run-Time Analysis 

The overall computational complexity of NK algorithm mainly depends on the 
amount of time it requires to compute nearest neighbors and compute eigenvalues 
of local covariance matrixes. The complexity of computing nearest neighbors is 
0{dN’^) [9]. However, some other nearest neighbors computing algorithms such 
as K-D trees can be used to compute the neighbors in time 0{N log N) [20]. 
Computing the eigenvalues of one local covariance matrix scales as 0{<P) [9]. As 
a result, the overall complexity of NK algorithm is 0{dN'^ + d^N). It will be 
greatly sped up if a faster neighbors computing algorithm is used. 

We have implemented NK algorithm in MATLAB, running on a Pentium 4 
2.0GHz processor. Table 2 gives the actual running times of the algorithm on 
different data sets of the same dimension of 2. 

Table 2. Running time in seconds 



Data Set 


Size 


Graph construction 


Denoising 


Overall 


1 


2000 


2.4 


0.8 


3.2 


2 


4000 


11.1 


1.6 


12.7 


3 


6000 


27.5 


2.5 


30 


4 


8000 


41 


4 


45 



5 Discussion 

Distance metrics are very important in many learning and data mining algo- 
rithms. MNN distance in knowledge embedding space is used as the new distance 
metric in NK algorithm. Here some explanation of how this distance metric is 
used for clustering will be given. If two points and are close under this 
distance metric, that is Xi is close to Xj and is close to \j, it means that 
in the input space, the coordinates of these two points are close and their local 
features are similar. In mutual neighborhood graph construction, each point is 
connected with its neighbors. This ensures that Xi is close to Xj if they are 
neighbors. Although, corresponding A^ and A^ are not always close since there 
are some false edges. So denoising is performed to remove false edges and after 
denoising, A^ and Xj become close. Because of this, in a denoised neighborhood 
graph, each pair of neighbors is close in knowledge embedded space. Note, the 
new distance metric is not used to measure the distance between each pair of 
points in the same cluster, but only the distance of neighbors. 

There are three main parameters that are determined experimentally. The 
most important parameter is the number of neighbors K. The other two are 
Pnoise and Pfaise- In practice, first decide Pnoise and Pfaise, and then choose K. 
They can be chosen almost independently. 
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For data set without noise, Pnoise should be a large number and easy to 
choose, such as 6 to 8 or even larger. If there is background noise, Pnoise should 
be a small number, such as 0.1-1. 5. We can set Pnoise equal to 1 and then 
reduce it if not all the noise is removed or increase it if too many data points 
are removed. When choosing Pnoise, the value of K is not important as long 
as it is not too small or too large. Pfaise is relative easy to choose. We can set 
Pfaise = 2 in most cases. If some points with false edges are not detected, then 
Pfaise should be smaller. 

As we know, in neighborhood graph construction algorithm, K is difficult 
to choose, especially, when data set are complex. In our algorithm, K is easier 
to choose than many other algorithms profiting from the denoising trick. When 
false edges can’t be removed, even if the points with false edges are detected, K 
should be smaller. When a data set is sparse or it is asymmetrical, sometimes a 
cluster will break where the data is very sparse. At this time, increasing K has 
some effect, but not a thorough solution. This is still a problem to be solved. 
Most of other clustering algorithm will still fail in the same environment. In our 
experiments, NK algorithm works much better than DBSCAN when the data 
set is sparse or it is asymmetrical. 

6 Conclusion 

This paper presents a new algorithm for clustering by knowledge embedding. 
Mutual neighborhood graphing is used for knowledge discovery and clustering. 
Eigenvalues of local covariance matrixes are used to represent local features. 
Denoising is needed because data sets are usually complex. The experiment 
results show that the algorithm can construct a quality neighborhood graph 
from a complex and noisy data set and it efficiently solves clustering problems. 
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Abstract. In multi-instance learning, the training set comprises labeled 
bags that are composed of unlabeled instances, and the task is to predict 
the labels of unseen bags. Through analyzing two famous multi-instance 
learning algorithms, this paper shows that many supervised learning al- 
gorithms can be adapted to multi-instance learning, as long as their 
focuses are shifted from the discrimination on the instances to the dis- 
crimination on the bags. Moreover, considering that ensemble learning 
paradigms can effectively enhance supervised learners, this paper pro- 
poses to build ensembles of multi-instance learners to solve multi-instance 
problems. Experiments on a real-world benchmark test show that ensem- 
ble learning paradigms can significantly enhance multi-instance learners, 
and the result achieved by EM-DD ensemble exceeds the best result on 
the benchmark test reported in literature. 



1 Introduction 

The term multi-instance learning was coined by Dietterich et al. [11] when they 
were investigating the problem of drug activity prediction. In this learning frame- 
work, the training set is composed of many bags each contains many instances. 
A bag is positively labeled if it contains at least one positive instance. Other- 
wise it is negatively labeled. The task is to learn some concept from the training 
bags for correctly labeling unseen bags. This task is very difficult because unlike 
supervised learning where all the training instances are labeled, here the labels 
of the individual instances are unknown. It has been shown that learning al- 
gorithms ignoring the characteristics of multi-instance learning could not work 
well in this scenario [11]. 

The PAC-learnability of multi-instance learning has been studied by many 
researchers [2] [3] [5] [13], and some important results, such as ‘if the instances in 
the bags are not independent then APR (Axis-Parallel Rectangle) learning [11] 
under the multi-instance learning framework is NP-hard’ [3], have been obtained. 
At present, the most famous multi-instance learning algorithm is Diverse Den- 
sity [14] which has been applied to several applications including stock predic- 
tion [14], natural scene classification [15], and content-based image retrieval [20]. 
There are also many other practical algorithms, such as Citation-fcNN [18], Relic 
[17], ID3-MI [8], RIPPER-MI [8], EM-DD [21], BP-MIP [23], etc. Recently, multi- 
instance regression with real- valued outputs has begun to be studied [1][16]. It 
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is worth noting that multi-instance learning has also attracted the attention of 
the ILP community. It has been suggested that multi-instance problems could 
be regarded as a bias on inductive logic programming, and the multi-instance 
paradigm could be the key between the propositional and relational representa- 
tions, being more expressive than the former, and much easier to learn than the 
latter [9]. 

In this paper, two famous multi-instance learning algorithms, i.e. Diverse 
Density and Citation-fcNN, are analyzed, which suggests that many supervised 
learning algorithms can be adapted to multi-instance learning as long as they 
attempt to discriminate the bags instead of the instances. Then, considering that 
ensemble learning paradigms that train multiple learners to solve a problem can 
effectively improve the generalization ability in supervised learning [10], this 
paper proposes to build multi-instance ensembles to solve multi-instance prob- 
lems. Experiments on a real-world benchmark data set show that current multi- 
instance learners can be significantly enhanced by ensemble learning paradigms. 
Moreover, it is observed that the ensemble of a specific multi-instance learner, 
i.e. EM-DD, exhibits the best performance up to date on the benchmark test. 

The rest of this paper is organized as follows. Section 2 analyzes the Diverse 
Density algorithm and the Citation-fcNN algorithm. Section 3 proposes to build 
multi-instance ensembles. Section 4 presents the experimental results. Finally, 
Section 5 summarizes the contributions of this paper. 

2 Adapt Supervised Algorithms 
to Multi-instance Learning 

When proposing the notion of multi-instance learning, Dietterich et al. [11] raised 
an open problem, i.e. designing multiple instance modifications for popular ma- 
chine learning algorithms. In fact, multi-instance versions of many machine learn- 
ing algorithms have been developed in recent years [8] [17] [18] [23]. However, there 
is no general rule indicating how to do such a modification. 

Usually, the focus of a supervised learning algorithm is to discriminate the 
instances, which is feasible since all training instances are labeled in supervised 
scenario. But in multi-instance learning, it is infeasible to build a model through 
discriminating training instances because none of them is labeled. Moreover, if 
the label of a bag is simply regarded as the label of its instances, i.e. to believe 
that positive bag contains only positive instances and negative bag contains 
only negative instances, then the learning task may be very difficult although 
every training instance holds a label now. This is because the positive noise 
may be extremely high^, as indicated by [11]. Therefore, whether it is possible 
to discriminate the training instances or not is the principal difference between 
supervised and multi-instance learning. 

In this section we claim that many supervised learning algorithms can be 
adapted to multi-instance learning, as long as they shift their focuses from the 

^ Consider that a positive bag may contain hundreds or even thousands of negative 
instances but only one positive iirstance. 
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discrimination on the instances to the discrimination on the bags. We illustrate 
that two well-known multi-instance learning algorithms^, i.e. Diverse Density 
and Citation-fcNN, can be derived from standard Bayesian classifier and fc-nearest 
neighbor algorithm according to our claim. These two algorithms are chosen 
to analyze because Diverse Density is the most famous multi-instance learning 
algorithm at present, and Citation-fcNN had achieved the best result on the real- 
world multi-instance benchmark test [18] before EM-DD, a variant of Diverse 
Density, was proposed. 



2.1 Diverse Density 



The Diverse Density algorithm [14] regards each bag as a manifold, which is 
composed of many instances, i.e. feature vectors. If a new bag is positive then 
it is believed to intersect all positive feature-manifolds without intersecting any 
negative feature-manifolds. Intuitively, diverse density at a point in the feature 
space is defined to be a measure of how many different positive bags have in- 
stances near that point, and how far the negative instances are from that point. 
Thus, the task of multi-instance learning is transformed to search for a point in 
the feature space with the maximum diverse density. 

It is evident that the key of the Diverse Density algorithm lies in the formal 
definition of the maximum diverse density, which is the objective to be optimized 
by the algorithm. Below we show that such a definition can be achieved through 
modifying standard Bayesian classifier according to the rule, i.e. shifting the 
focus from discriminating the instances to discriminating the bags. 

Given data set D and a set of class labels, i.e. C = {ci, C 2 , • • • , ct}, to be 
predicted, the posterior probability of the class can be estimated according to 
the Bayes rule as shown in Eq. 1. 



p,(cid) = El(£J£I2I2 

^ Pr (D) 



( 1 ) 



What we want is the class label with the maximum posterior probability, as 
indicated in Eq. 2, where Obj denotes the objective. 



Obj = argmaxPr (cfe \D) 

l<k<t 



= argmax 

i<fe<t 



Pr(D[cfc)Pr(cfc) 
Pr (D) 



(2) 



Considering that Pr (D) is a constant which can be dropped, and Pr (cfc) can 
also be dropped if we assume uniform prior, then Eq. 2 can be simplified as 
Eq. 3. 

Obj = arg maxPr {D \ ck ) (3) 

i<fc<t 

^ Due to the limited paper length, the analyses of more multi-instance learning algo- 
rithms are left to be presented in a longer version of this paper. 
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Eq. 3 is enough when the goal is to discriminate the instances. But for dis- 
criminating the bags, it is helpful to consider D = {B ^ , • • • , B~} 
where Bf denotes the f-th positive bag while B~ denotes the j-th negative bag. 
Then, Eq. 3 can be re-written as Eq. 4 assuming that the bags are conditionally 
independent. 



Obj = argmaxPr {{B^ 

Kk<t 



,B+,B^,- 






= argmax TT Pr(B+|cfe) TT Pr (S \ck) 
Now apply the Bayes rule to Eq. 4, we get Eq. 5. 



Obj = argmax 



Pr(cfe|B+)Pr(i?+) „ Pr(c,.|i?-)Pr(B-) 



Pr (cfe) 



n 

l<j<n 



Pr (cfc) 



( 4 ) 



( 5 ) 



Considering that n pr(B+) n Pr [B^ ) is a constant which can be 

l<2<m 

dropped, and reminding that Pr (cfc) can be dropped as that has been done in 
Eq. 3 because we assume uniform prior, then Eq. 5 can be simplified as Eq. 6. 



Obj 



= arg max 

l<k<t 



l<2<m 



n pr(cfe|s-) 

l<j<n 



(6) 



Eq. 6 is the general expression for the class label with the maximum posterior 
probability. Concretely, the class label for a specific point x in the feature space 
can be expressed as Eq. 7, where {x = Ck) means the label of a; is Cfc. 



Obj“ = argmax TT Pr(a; = Cfc|B+) TT Pr{x = Ck\B-) (7) 

If we want to find out a single point in the feature space where the maximum 
posterior probability of a specific class label, say Ch, is the biggest, then the point 
can be located according to Eq. 8. 



X = argmaxPr (Obj^ = Ch) 

X 

= argmax Pr (x = Ct \Bf ) Pv {x = cu \B~ ) (8) 

^ l<i<m 



It is interesting that Eq. 8 is neither more nor less than the formal definition 
of the maximum diverse density which is optimized by the Diverse Density 
algorithm [14]! 



2.2 Citation-fcNN 

The Citation-fcNN algorithm [18] is a nearest neighbor style algorithm, which 
borrows the notion of citation of scientific references in the way that a bag is 
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labeled through analyzing not only its neighboring bags but also the bags that 
regard the concerned bag as a neighbor. 

Nevertheless, it is evident that for any nearest neighbor style algorithm, the 
key lies in the definition of the distance metric which is utilized to measure the 
distance between different objects. Below we show that the key of Citation-fcNN, 
i.e. the definition of the minimal Hausdorff distance, can be achieved through 
modifying standard fc-nearest neighbor algorithm according to the rule, i.e. shift- 
ing the focus from discriminating the instances to discriminating the bags. 

In standard fc-nearest neighbor algorithm, each object, or instance, is re- 
garded as a feature vector in the feature space. For two different feature vectors, 
i.e. a and b, the distance between them can be written as Eq. 9. Usually ||a — b\\ 
is realized as the Euclidean distance. 

Dist (a, b) = ||a — 6 || (9) 

When the goal is to discriminate the instances, Eq. 9 is enough to be instan- 
tiated. But if the goal is to discriminate the bags, then Eq. 9 must be extended 
because now we should measure the distance between different bags. 

Suppose we have two different bags, i.e. A = {oi, 02 , • • • , a^} and B = 
{bi, 62 , • • • , bn} where Oi (1 < * < m) and bj (1 < j < n) are the instances. It 
is obvious that they can be regarded as two feature vector sets, where each Oi 
{I < i < m) or bj (I < j < n) is a feature vector in the feature space. There- 
fore, the problem of measuring the distance between different bags is in fact the 
problem of measuring the distance between different feature vector sets. 

Geometrically, a feature vector set can be viewed as a group of points enclosed 
in a contour in the feature space. Thus, an intuitive way to measure the distance 
between two feature vector sets is to define their distance as the distance between 
their nearest feature vectors, as illustrated in Fig. 1. 



Fig. 1. 

Formally, such 




O 



, bae/f 
\ ' 




Dist(.4, B) 
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bagii / 





An intuitive way to define the distance between bags 

a distance metric can be written as Eq. 10. 
Dist(A, i?)= MIN (Dist (oi, 6j)) 

l<2<m 

l<j<n 

= MINMINlla- 611 

aeA beB 



( 10 ) 
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It is interesting that Eq. 10 is neither more nor less than the formal definition 
of the minimum Hausdorff distance, which is employed by the Citation-fcNN 
algorithm to measure the distance between different bags [18]! 

Note that although Wang and Zucker admitted that using the minimal Haus- 
dorff distance does allow fc-nearest neighbor algorithm to be adapted to multi- 
instance learning, they also indicated that it is not sufficient [18]. This is is be- 
cause the common prediction-generating scheme employed by fc-nearest neighbor 
algorithms, i.e. majority voting, might be confused by false positive instances in 
positive bags in some cases. Therefore as mentioned before, the notion of citation 
and reference is introduced for obtaining the optimal performance. 

However, it is evident that the utilization of the notion of citation and ref- 
erence does not change the fact that the minimal Hausdorff distance is the key 
in adapting fc-nearest neighbor algorithms to multi-instance learning. This is 
because the notion of citation and reference can also be introduced to improve 
the performance of fc-nearest neighbor algorithms dealing with supervised learn- 
ing tasks. More importantly, a fc-nearest neighbor algorithm employing common 
distance metrics such as the Euclidean distance cannot work in multi-instance 
scenarios, even though it were facilitated with the notion of citation and ref- 
erence; while a fc-nearest neighbor algorithm employing the minimal Hausdorff 
distance can work in multi-instance scenarios, even though it does not take ci- 
tation and reference into account. 

In fact, through analyzing the experimental data presented in the Appendix 
of Wang and Zucker’s paper [18], it could be found that when k is 3, the per- 
formance of the fc-nearest neighbor algorithm employing the minimal Hausdorff 
distance without utilizing citation and reference is already comparable to or even 
better than that of some multi-instance learning algorithms such as Relic [17] 
and MULTINST [2] on Muskl, and RIPPER-MI [8] and GFS elim-count APR 
[11] on Musk2. Moreover, if the fact that the occurrence of positive bags is far 
smaller than that of negative bags has been considered so that a new bag is neg- 
atively labeled when ties appear in determining its label, the performance of the 
fc-nearest neighbor algorithm employing the minimal Hausdorff distance without 
utilizing citation and reference would be 90.2% on Muskl and 82.4% on Musk2, 
respectively, when fc is 2. It is interesting that this reaches the best performance 
of another multi-instance fc-nearest neighbor algorithm, i.e. Bayesian-fcNN, pro- 
posed by Wang and Zucker [18]. 

3 Multi-instance Ensemble 

Ensemble learning paradigms train multiple versions of a base learner to solve 
a problem. Since ensembles are usually more accurate than single learners, one 
of the most active areas of research in supervised learning has been to study 
paradigms for constructing good ensembles [10]. 

Since we have shown in Section 2 that many supervised learning algorithms 
can be adapted to multi-instance learning, a consequent exciting idea is to see 
whether ensemble learning paradigms can be used to enhance multi-instance 
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learners. Here we call ensemble of multi-instance learners as multi-instance en- 
semble. 

During the past years, diverse ensemble learning algorithms have been devel- 
oped, such as Bagging [6], Arc-x4 [7], AdaBoost [12], MultiBoost [19], GASEN 
[22], etc. In this section, we use a relatively simple algorithm, i.e. Bagging, to 
build the multi-instance ensembles. 

Bagging employs bootstrap sampling to generate several training sets from 
the original training set and then trains component learners, i.e. multiple versions 
of the base learner, from each generated training set. The predictions of the 
component learners are combined via majority voting. The Bagging algorithm 
is shown in Table 1, where T bootstrap samples Si^ S 2 , ■ ■ ■ , St are generated 
from the training set S and a component learner Lt is trained from each St , an 
ensemble L* is built from Li, L 2 , - ■ ■ , Lt whose output is the class label receiving 
the most number of votes, x is the input feature vector, and Y is the set of class 
labels. 



Table 1. The Bagging algorithm 



Input: training set S, base learner L, trials of bootstrap sampling T 
Output: ensemble L* 

Process: 

for t = 1 to T { 

St = bootstrap sample from S 
Lt = L {St) 

} 

L* (x) = arg max ^ 1 

ySY t: Lt(x) — y 



We attempt to build multi-instance ensembles for four different base learn- 
ers, i.e. Iterated-discrim APR [11], Diverse Density [14], Citation-fcNN [18], and 
EM-DD [21]. The reason for choosing Diverse Density and Citation-fcNN was 
discussed in Section 2. Here we briefly explain why the other two algorithms are 
chosen. 

Iterated-discrim APR is the best Axis-Parallel Rectangle (abbreviated as 
APR) algorithm proposed by Dietterich et al. [11], which attempts to search 
for appropriate axis-parallel rectangles constructed by the conjunction of the 
features. Dietterich et al. [11] indicated that since the APR algorithms had been 
optimized to the Musk data, i.e. the only real-world multi-instance benchmark 
data until now, the performance of Iterated-discrim APR might be the upper 
bound of this benchmark test. 

EM-DD [21] is a recent development in multi-instance learning, which com- 
bines the EM and Diverse Density algorithms. It converts the multi-instance 
problem to a single-instance setting by using EM to estimate the instance which 
is responsible for the label of the bag. The best performance on the real-world 
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multi-instance benchmark test until now, i.e. predictive error rate as small as 
3.2% on Muskl and 4.0% on Musk2, are achieved by this algorithm [21]. Note 
that the performance of EM-DD has already exceeded the upper bound of this 
benchmark test anticipated by Dietterich et al. [11]. 

4 Experiments 

The experiments are performed on the Musk data, which is the only real-world 
benchmark test data for multi-instance learners at present. 

The Musk data were generated in Dietterich et al.’s research on drug activ- 
ity prediction [11]. Here each molecule is regarded as a bag, and its alternative 
low-energy shapes are regarded as the instances in the bag. A positive bag cor- 
responds to a molecule qualified to make a certain drug, that is, at least one 
of its low-energy shapes could tightly bind to the target area of some larger 
protein molecules such as enzymes and cell-surface receptors. A negative bag 
corresponds to a molecule not qualified to make a certain drug, that is, none of 
its low-energy shapes could tightly bind to the target area. 

In order to represent the shapes, a molecule is placed in a standard posi- 
tion and orientation and then a set of 162 rays emanating from the origin is 
constructed so that the molecular surface is sampled approximately uniformly. 
There are also four features that represented the position of an oxygen atom on 
the molecular surface. Therefore each instance in the bags is represented by 166 
continuous attributes. 

There are two data sets, i.e. Muskl and Musk2, both of which are publicly 
available from the UCI Machine Learning Repository [4]. Muskl contains 47 
positive bags and 45 negative bags, and the number of instances contained in 
each bag ranges from 2 to 40. Musk2 contains 39 positive bags and 63 negative 
bags, and the number of instances contained in each bag ranges from 1 to 1,044. 
Detailed information on the Musk data is tabulated in Table 2. 



Table 2. The Musk data (72 molecules are shared in both data sets) 



Data set 


Dim. 




Bags 


Instances 


Instances per bag 


Total 


Musk Non- musk 


Min 


Max 


Ave. 


Muskl 


166 


92 


47 45 


476 


2 


40 


5.17 


Musk2 


166 


102 


39 63 


6,598 


1 


1,044 


64.69 



Ten-fold cross validation is performed on each Musk data set. In each fold. 
Bagging is employed to build an ensemble for each of the four base multi-instance 
learners, i.e. Iterated-discrim APR, Diverse Density, Citation-fcNN, and EM-DD. 
Each ensemble comprises five versions of the base learner. The predictive error 
rates of the ensembles are shown in Table 3. For comparison, the best results of 
the single multi-instance learners reported in the literatures [11] [14] [18] [21] are 
also included in Table 3. 
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Table 3. Predictive error rates (%) of single or ensembled multi-instance learners 



Algorithm 


Muskl 


Musk2 


Single 


Ensemble 


Single 


Ensemble 


Iterated- discrim APR 


7.6 


7.2 


10.8 


6.9 


Diverse Density 


11.1 


8.2 


17.5 


11.0 


Citation-kNN 


7.6 


5.2 


13.7 


12.9 


EM-DD 


3.2 


3.1 


4.0 


3.0 



Table 3 shows that Bagging can significantly improve the generalization abil- 
ity of all the investigated multi-instance learners^. It is impressive that even the 
strongest multi-instance learner, i.e. EM-DD, can be enhanced by such a rela- 
tively simple ensemble learning algorithm. In fact, the EM-DD ensemble achieves 
the best performance up to date on both the Musk data sets, i.e. predictive error 
rate 3.1% on Muskl and 3.0% on Musk2. 

Since the process of building ensemble of multi-instance learners has noth- 
ing being geared to any specific data, we believe that such a paradigm can be 
applied to any multi-instance problems. It is also reasonable to anticipate that 
such a paradigm may return more profit on difficult problems where no single 
multi-instance learners works very well. Moreover, the experiments reported in 
this section also suggest ensemble learning paradigms be investigated in more 
scenarios, not to be limited in supervised learning. 

5 Conclusion 

When formalizing the notion of multi-instance learning, Dietterich et al. [11] 
raised an open problem, i.e. designing multiple instance modifications for popular 
machine learning algorithms. Although multi-instance versions of many machine 
learning algorithms have been developed in recent years, there is no general rule 
indicating how to do such a modification until now. 

This paper claims that many supervised learning algorithms can be adapted 
to multi-instance learning through shifting their focuses from the discrimination 
on instances to the discrimination on bags. Although the concrete shift process 
is dependent on the working mechanism of the supervised learning algorithm 
concerned, the rule for adaptation is feasible and general enough to be applied to 
diverse supervised learning algorithms. For example, this paper illustrates that 
how two famous multi-instance algorithms, i.e. Diverse Density and Citation- 
fcNN, can be derived from standard Bayesian classifier and fc-nearest neighbor 
algorithm, respectively, through shifting their focuses. 

Designing multi-instance learning algorithms with strong generalization abil- 
ity is always an important issue in this area. Considering that many supervised 

® The results of the single multi-instance learners in Table 3 are the best results 
reported by their authors [11][14][18][21]. In our implementation, the performance 
of the single learners are slightly worse than these best results. 
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learning algorithms can be adapted to multi-instance learning, and ensemble 
learning paradigms can effectively enhance supervised learners, this paper claims 
to build multi-instance ensembles to solve multi-instance problems. 

Experiments show that all the investigated multi-instance learners can be 
enhanced by a relatively simple ensemble learning algorithm, and the best result 
up to date on the real-world benchmark test of multi-instance learners is achieved 
by EM-DD ensemble. The experiments not only support our claim that building 
multi-instance ensembles is a good choice for solving multi-instance problems, 
but also suggest ensemble learning paradigms be investigated in more scenarios, 
not to be limited in supervised learning. 
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